aboutsummaryrefslogtreecommitdiffstats
path: root/fs/nfs/direct.c
AgeCommit message (Collapse)AuthorFilesLines
2024-03-16Merge tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-3/+15
Pull NFS client updates from Trond Myklebust: "Highlights include: Bugfixes: - Fix for an Oops in the NFSv4.2 listxattr handler - Correct an incorrect buffer size in listxattr - Fix for an Oops in the pNFS flexfiles layout - Fix a refcount leak in NFS O_DIRECT writes - Fix missing locking in NFS O_DIRECT - Avoid an infinite loop in pnfs_update_layout - Fix an overflow in the RPC waitqueue queue length counter - Ensure that pNFS I/O is also protected by TLS when xprtsec is specified by the mount options - Fix a leaked folio lock in the netfs read code - Fix a potential deadlock in fscache - Allow setting the fscache uniquifier in NFSv4 - Fix an off by one in root_nfs_cat() - Fix another off by one in rpc_sockaddr2uaddr() - nfs4_do_open() can incorrectly trigger state recovery - Various fixes for connection shutdown Features and cleanups: - Ensure that containers only see their own RPC and NFS stats - Enable nconnect for RDMA - Remove dead code from nfs_writepage_locked() - Various tracepoint additions to track EXCHANGE_ID, GETDEVICEINFO, and mount options" * tag 'nfs-for-6.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (29 commits) nfs: fix panic when nfs4_ff_layout_prepare_ds() fails NFS: trace the uniquifier of fscache NFS: Read unlock folio on nfs_page_create_from_folio() error NFS: remove unused variable nfs_rpcstat nfs: fix UAF in direct writes nfs: properly protect nfs_direct_req fields NFS: enable nconnect for RDMA NFSv4: nfs4_do_open() is incorrectly triggering state recovery NFS: avoid infinite loop in pnfs_update_layout. NFS: remove sync_mode test from nfs_writepage_locked() NFSv4.1/pnfs: fix NFS with TLS in pnfs NFS: Fix an off by one in root_nfs_cat() nfs: make the rpc_stat per net namespace nfs: expose /proc/net/sunrpc/nfs in net namespaces sunrpc: add a struct rpc_stats arg to rpc_create_args nfs: remove unused NFS_CALL macro NFSv4.1: add tracepoint to trunked nfs4_exchange_id calls NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned int nfs: fix regression in handling of fsc= option in NFSv4 ...
2024-03-12mm, slab: remove last vestiges of SLAB_MEM_SPREADLinus Torvalds1-2/+1
Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-09nfs: fix UAF in direct writesJosef Bacik1-2/+9
In production we have been hitting the following warning consistently ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30 This is because we're completing the nfs_direct_request twice in a row. The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq); However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice. The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a nfs_commit_begin(); nfs_commit_end(); Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths. Fix this by using the same pattern for the commit requests. Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-03-09nfs: properly protect nfs_direct_req fieldsJosef Bacik1-1/+6
We protect accesses to the nfs_direct_req fields with the dreq->lock ever where except nfs_direct_commit_complete. This isn't a huge deal, but it does lead to confusion, and we could potentially end up setting NFS_ODIRECT_RESCHED_WRITES in one thread where we've had an error in another. Clean this up to properly protect ->error and ->flags in the commit completion path. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-01-04NFS: drop unused nfs_direct_req bytes_leftBenjamin Coddington1-4/+2
Now that we're calculating how large a remaining IO should be based on the current request's offset, we no longer need to track bytes_left on each struct nfs_direct_req. Drop the field, and clean up the direct request tracepoints. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-01-04pNFS: Fix the pnfs block driver's calculation of layoutget sizeTrond Myklebust1-2/+3
Instead of relying on the value of the 'bytes_left' field, we should calculate the layout size based on the offset of the request that is being written out. Reported-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Fixes: 954998b60caa ("NFS: Fix error handling for O_DIRECT write scheduling") Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: More fixes for nfs_direct_write_reschedule_io()Trond Myklebust1-6/+11
Ensure that all requests are put back onto the commit list so that they can be rescheduled. Fixes: 4daaeba93822 ("NFS: Fix nfs_direct_write_reschedule_io()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: Use the correct commit info in nfs_join_page_group()Trond Myklebust1-3/+5
Ensure that nfs_clear_request_commit() updates the correct counters when it removes them from the commit list. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: More O_DIRECT accounting fixes for error pathsTrond Myklebust1-16/+31
If we hit a fatal error when retransmitting, we do need to record the removal of the request from the count of written bytes. Fixes: 031d73ed768a ("NFS: Fix O_DIRECT accounting of number of bytes read/written") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13NFS: Fix O_DIRECT locking issuesTrond Myklebust1-4/+4
The dreq fields are protected by the dreq->lock. Fixes: 954998b60caa ("NFS: Fix error handling for O_DIRECT write scheduling") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-11NFS: Fix error handling for O_DIRECT write schedulingTrond Myklebust1-16/+46
If we fail to schedule a request for transmission, there are 2 possibilities: 1) Either we hit a fatal error, and we just want to drop the remaining requests on the floor. 2) We were asked to try again, in which case we should allow the outstanding RPC calls to complete, so that we can recoalesce requests and try again. Fixes: d600ad1f2bdb ("NFS41: pop some layoutget errors to application") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24NFS: Fix a potential data corruptionTrond Myklebust1-1/+19
We must ensure that the subrequests are joined back into the head before we can retransmit a request. If the head was not on the commit lists, because the server wrote it synchronously, we still need to add it back to the retransmission list. Add a call that mirrors the effect of nfs_cancel_remove_inode() for O_DIRECT. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-17NFS: Fix a use after free in nfs_direct_join_group()Trond Myklebust1-10/+16
Be more careful when tearing down the subrequests of an O_DIRECT write as part of a retransmission. Reported-by: Chris Mason <clm@fb.com> Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-02-14NFS: Clean up O_DIRECT request allocationTrond Myklebust1-8/+4
Rather than adjusting the index+offset after the call to nfs_create_request(), add a function nfs_page_create_from_page() that takes an offset. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-08-10Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-35/+15
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - pNFS/flexfiles: Fix infinite looping when the RDMA connection errors out Bugfixes: - NFS: fix port value parsing - SUNRPC: Reinitialise the backchannel request buffers before reuse - SUNRPC: fix expiry of auth creds - NFSv4: Fix races in the legacy idmapper upcall - NFS: O_DIRECT fixes from Jeff Layton - NFSv4.1: Fix OP_SEQUENCE error handling - SUNRPC: Fix an RPC/RDMA performance regression - NFS: Fix case insensitive renames - NFSv4/pnfs: Fix a use-after-free bug in open - NFSv4.1: RECLAIM_COMPLETE must handle EACCES Features: - NFSv4.1: session trunking enhancements - NFSv4.2: READ_PLUS performance optimisations - NFS: relax the rules for rsize/wsize mount options - NFS: don't unhash dentry during unlink/rename - SUNRPC: Fail faster on bad verifier - NFS/SUNRPC: Various tracing improvements" * tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits) NFS: Improve readpage/writepage tracing NFS: Improve O_DIRECT tracing NFS: Improve write error tracing NFS: don't unhash dentry during unlink/rename NFSv4/pnfs: Fix a use-after-free bug in open NFS: nfs_async_write_reschedule_io must not recurse into the writeback code SUNRPC: Don't reuse bvec on retransmission of the request SUNRPC: Reinitialise the backchannel request buffers before reuse NFSv4.1: RECLAIM_COMPLETE must handle EACCES NFSv4.1 probe offline transports for trunking on session creation SUNRPC create a function that probes only offline transports SUNRPC export xprt_iter_rewind function SUNRPC restructure rpc_clnt_setup_test_and_add_xprt NFSv4.1 remove xprt from xprt_switch if session trunking test fails SUNRPC create an rpc function that allows xprt removal from rpc_clnt SUNRPC enable back offline transports in trunking discovery SUNRPC create an iterator to list only OFFLINE xprts NFSv4.1 offline trunkable transports on DESTROY_SESSION SUNRPC add function to offline remove trunkable transports SUNRPC expose functions for offline remote xprt functionality ...
2022-08-08iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()Al Viro1-4/+2
Most of the users immediately follow successful iov_iter_get_pages() with advancing by the amount it had returned. Provide inline wrappers doing that, convert trivial open-coded uses of those. BTW, iov_iter_get_pages() never returns more than it had been asked to; such checks in cifs ought to be removed someday... Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08new iov_iter flavour - ITER_UBUFAl Viro1-1/+1
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(), checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC ones. We are going to expose the things like ->write_iter() et.al. to those in subsequent commits. New predicate (user_backed_iter()) that is true for ITER_IOVEC and ITER_UBUF; places like direct-IO handling should use that for checking that pages we modify after getting them from iov_iter_get_pages() would need to be dirtied. DO NOT assume that replacing iter_is_iovec() with user_backed_iter() will solve all problems - there's code that uses iter_is_iovec() to decide how to poke around in iov_iter guts and for that the predicate replacement obviously won't suffice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-23nfs: only issue commit in DIO codepath if we have uncommitted dataJeff Layton1-1/+1
Currently, we try to determine whether to issue a commit based on nfs_write_need_commit which looks at the current verifier. In the case where we got a short write and then tried to follow it up with one that failed, the verifier can't be trusted. What we really want to know is whether the pgio request had any successful writes that came back as UNSTABLE. Add a new flag to the pgio request, and use that to indicate that we've had a successful unstable write. Only issue a commit if that flag is set. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-23nfs: always check dreq->error after a commitJeff Layton1-1/+2
When the client gets back a short DIO write, it will then attempt to issue another write to finish the DIO request. If that write then fails (as is often the case in an -ENOSPC situation), then we still may need to issue a COMMIT if the earlier short write was unstable. If that COMMIT then succeeds, then we don't want the client to reschedule the write requests, and to instead just return a short write. Otherwise, we can end up looping over the same DIO write forever. Always consult dreq->error after a successful RPC, even when the flag state is not NFS_ODIRECT_DONE. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2028370 Reported-by: Boyang Xue <bxue@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-23nfs: add new nfs_direct_req tracepoint eventsJeff Layton1-33/+12
Add some new tracepoints to the DIO write code. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-05-09nfs: rename nfs_direct_IO and use as ->swap_rwNeilBrown1-13/+10
The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a while. We now need a ->swap_rw function which behaves slightly differently, returning zero for success rather than a byte count. So modify nfs_direct_IO accordingly, rename it, and use it as the ->swap_rw function. Link: https://lkml.kernel.org/r/165119301493.15698.7491285551903597618.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> (on Renesas RSK+RZA1 with 32 MiB of SDRAM) Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-03-13NFS: swap-out must always use STABLE writes.NeilBrown1-4/+6
The commit handling code is not safe against memory-pressure deadlocks when writing to swap. In particular, nfs_commitdata_alloc() blocks indefinitely waiting for memory, and this can consume all available workqueue threads. swap-out most likely uses STABLE writes anyway as COND_STABLE indicates that a stable write should be used if the write fits in a single request, and it normally does. However if we ever swap with a small wsize, or gather unusually large numbers of pages for a single write, this might change. For safety, make it explicit in the code that direct writes used for swap must always use FLUSH_STABLE. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFS: swap IO handling is slightly different for O_DIRECT IONeilBrown1-14/+28
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding possible deadlocks with "fs_reclaim". These deadlocks could, I believe, eventuate if a buffered read on the swapfile was attempted. We don't need coherence with the page cache for a swap file, and buffered writes are forbidden anyway. There is no other need for i_rwsem during direct IO. So never take it for swap_rw() 2/ generic_write_checks() explicitly forbids writes to swap, and performs checks that are not needed for swap. So bypass it for swap_rw(). Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-01-10nfs: Convert to new fscache volume/cookie APIDave Wysochanski1-0/+2
Change the nfs filesystem to support fscache's indexing rewrite and reenable caching in nfs. The following changes have been made: (1) The fscache_netfs struct is no more, and there's no need to register the filesystem as a whole. (2) The session cookie is now an fscache_volume cookie, allocated with fscache_acquire_volume(). That takes three parameters: a string representing the "volume" in the index, a string naming the cache to use (or NULL) and a u64 that conveys coherency metadata for the volume. For nfs, I've made it render the volume name string as: "nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]" (3) The fscache_cookie_def is no more and needed information is passed directly to fscache_acquire_cookie(). The cache no longer calls back into the filesystem, but rather metadata changes are indicated at other times. fscache_acquire_cookie() is passed the same keying and coherency information as before. (4) fscache_enable/disable_cookie() have been removed. Call fscache_use_cookie() and fscache_unuse_cookie() when a file is opened or closed to prevent a cache file from being culled and to keep resources to hand that are needed to do I/O. If a file is opened for writing, we invalidate it with FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache, thereby making it cease caching until all currently open files are closed. This should give the same behaviour as the uptream code. Making the cache store local modifications isn't straightforward for NFS, so that's left for future patches. (5) fscache_invalidate() now needs to be given uptodate auxiliary data and a file size. It also takes a flag to indicate if this was due to a DIO write. (6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file to which a DIO write is made. (7) Call fscache_note_page_release() from nfs_release_page(). (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for PG_fscache to be cleared. (9) The functions to read and write data to/from the cache are stubbed out pending a conversion to use netfslib. Changes ======= ver #3: - Added missing =n fallback for nfs_fscache_release_file()[1][2]. ver #2: - Use gfpflags_allow_blocking() rather than using flag directly. - fscache_acquire_volume() now returns errors. - Remove NFS_INO_FSCACHE as it's no longer used. - Need to unuse a cookie on file-release, not inode-clear. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna.schumaker@netapp.com> cc: linux-nfs@vger.kernel.org cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1] Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2] Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
2021-11-10Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+1
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN - Optimisations for READDIR and the 'ls -l' style workload - Further replacements of dprintk() with tracepoints and other tracing improvements - Ensure we re-probe NFSv4 server capabilities when the user does a "mount -o remount" Bugfixes: - Fix an Oops in pnfs_mark_request_commit() - Fix up deadlocks in the commit code - Fix regressions in NFSv2/v3 attribute revalidation due to the change_attr_type optimisations - Fix some dentry verifier races - Fix some missing dentry verifier settings - Fix a performance regression in nfs_set_open_stateid_locked() - SUNRPC was sending multiple SYN calls when re-establishing a TCP connection. - Fix multiple NFSv4 issues due to missing sanity checking of server return values - Fix a potential Oops when FREE_STATEID races with an unmount Cleanups: - Clean up the labelled NFS code - Remove unused header <linux/pnfs_osd_xdr.h>" * tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits) NFSv4: Sanity check the parameters in nfs41_update_target_slotid() NFS: Remove the nfs4_label argument from decode_getattr_*() functions NFS: Remove the nfs4_label argument from nfs_setsecurity NFS: Remove the nfs4_label argument from nfs_fhget() NFS: Remove the nfs4_label argument from nfs_add_or_obtain() NFS: Remove the nfs4_label argument from nfs_instantiate() NFS: Remove the nfs4_label from the nfs_setattrres NFS: Remove the nfs4_label from the nfs4_getattr_res NFS: Remove the f_label from the nfs4_opendata and nfs_openres NFS: Remove the nfs4_label from the nfs4_lookupp_res struct NFS: Remove the label from the nfs4_lookup_res struct NFS: Remove the nfs4_label from the nfs4_link_res struct NFS: Remove the nfs4_label from the nfs4_create_res struct NFS: Remove the nfs4_label from the nfs_entry struct NFS: Create a new nfs_alloc_fattr_with_label() function NFS: Always initialise fattr->label in nfs_fattr_alloc() NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open() NFSv4: Remove unnecessary 'minor version' check NFSv4: Fix potential Oops in decode_op_map() ...
2021-10-25fs: get rid of the res2 iocb->ki_complete argumentJens Axboe1-1/+1
The second argument was only used by the USB gadget code, yet everyone pays the overhead of passing a zero to be passed into aio, where it ends up being part of the aio res2 value. Now that everybody is passing in zero, kill off the extra argument. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20NFS: Fix up commit deadlocksTrond Myklebust1-1/+1
If O_DIRECT bumps the commit_info rpcs_out field, then that could lead to fsync() hangs. The fix is to ensure that O_DIRECT calls nfs_commit_end(). Fixes: 723c921e7dfc ("sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-13NFSv4: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECTTrond Myklebust1-10/+7
Fix an Oopsable condition in pnfs_mark_request_commit() when we're putting a set of writes on the commit list to reschedule them after a failed pNFS attempt. Fixes: 9c455a8c1e14 ("NFS/pNFS: Clean up pNFS commit operations") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-08-15Merge tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+1
Pull NFS client updates from Trond Myklebust: "Stable fixes: - pNFS: Don't return layout segments that are being used for I/O - pNFS: Don't move layout segments off the active list when being used for I/O Features: - NFS: Add support for user xattrs through the NFSv4.2 protocol - NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC - NFSv4.0 allow nconnect for v4.0 Bugfixes and cleanups: - nfs: ensure correct writeback errors are returned on close() - nfs: nfs_file_write() should check for writeback errors - nfs: Fix getxattr kernel panic and memory overflow - NFS: Fix the pNFS/flexfiles mirrored read failover code - SUNRPC: dont update timeout value on connection reset - freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS - sunrpc: destroy rpc_inode_cachep after unregister_filesystem" * tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (32 commits) NFS: Fix flexfiles read failover fs: nfs: delete repeated words in comments rpc_pipefs: convert comma to semicolon nfs: Fix getxattr kernel panic and memory overflow NFS: Don't return layout segments that are in use NFS: Don't move layouts to plh_return_segs list while in use NFS: Add layout segment info to pnfs read/write/commit tracepoints NFS: Add tracepoints for layouterror and layoutstats. NFS: Report the stateid + status in trace_nfs4_layoutreturn_on_close() SUNRPC dont update timeout value on connection reset nfs: nfs_file_write() should check for writeback errors nfs: ensure correct writeback errors are returned on close() NFSv4.2: xattr cache: get rid of cache discard work queue NFS: remove redundant initialization of variable result NFSv4.0 allow nconnect for v4.0 freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS sunrpc: destroy rpc_inode_cachep after unregister_filesystem NFSv4.2: add client side xattr caching. NFSv4.2: hook in the user extended attribute handlers NFSv4.2: add the extended attribute proc functions. ...
2020-07-28NFS: remove redundant initialization of variable resultColin Ian King1-1/+1
The variable result is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-07-17SUNRPC reverting d03727b248d0 ("NFSv4 fix CLOSE not waiting for direct IO ↵Olga Kornievskaia1-9/+4
compeletion") Reverting commit d03727b248d0 "NFSv4 fix CLOSE not waiting for direct IO compeletion". This patch made it so that fput() by calling inode_dio_done() in nfs_file_release() would wait uninterruptably for any outstanding directIO to the file (but that wait on IO should be killable). The problem the patch was also trying to address was REMOVE returning ERR_ACCESS because the file is still opened, is supposed to be resolved by server returning ERR_FILE_OPEN and not ERR_ACCESS. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-26NFSv4 fix CLOSE not waiting for direct IO compeletionOlga Kornievskaia1-4/+9
Figuring out the root case for the REMOVE/CLOSE race and suggesting the solution was done by Neil Brown. Currently what happens is that direct IO calls hold a reference on the open context which is decremented as an asynchronous task in the nfs_direct_complete(). Before reference is decremented, control is returned to the application which is free to close the file. When close is being processed, it decrements its reference on the open_context but since directIO still holds one, it doesn't sent a close on the wire. It returns control to the application which is free to do other operations. For instance, it can delete a file. Direct IO is finally releasing its reference and triggering an asynchronous close. Which races with the REMOVE. On the server, REMOVE can be processed before the CLOSE, failing the REMOVE with EACCES as the file is still opened. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Suggested-by: Neil Brown <neilb@suse.com> CC: stable@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11NFS: Fix direct WRITE throughput regressionChuck Lever1-0/+2
I measured a 50% throughput regression for large direct writes. The observed on-the-wire behavior is that the client sends every NFS WRITE twice: once as an UNSTABLE WRITE plus a COMMIT, and once as a FILE_SYNC WRITE. This is because the nfs_write_match_verf() check in nfs_direct_commit_complete() fails for every WRITE. Buffered writes use nfs_write_completion(), which sets req->wb_verf correctly. Direct writes use nfs_direct_write_completion(), which does not set req->wb_verf at all. This leaves req->wb_verf set to all zeroes for every direct WRITE, and thus nfs_direct_commit_completion() always sets NFS_ODIRECT_RESCHED_WRITES. This fix appears to restore nearly all of the lost performance. Fixes: 1f28476dcb98 ("NFS: Fix O_DIRECT commit verifier handling") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-11NFS: remove redundant initialization of variable resultColin Ian King1-1/+1
The variable result is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-01NFS: Try to join page groups before an O_DIRECT retransmissionTrond Myklebust1-0/+20
If we have to retransmit requests, try to join their page groups first. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-01NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()Trond Myklebust1-0/+1
nfs_direct_write_scan_commit_list() will lock the request and bump the reference count, but we also need to account for the reference that was taken when we initially added the request to the commit list. Fixes: fb5f7f20cdb9 ("NFS: commit errors should be fatal") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-27NFS/pNFS: Clean up pNFS commit operationsTrond Myklebust1-4/+2
Move the pNFS commit related operations into a separate structure that can be carried by the pnfs_ds_commit_info. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-27NFS: Remove bucket array from struct pnfs_ds_commit_infoTrond Myklebust1-1/+0
Remove the unused bucket array in struct pnfs_ds_commit_info. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-27NFS: Fix O_DIRECT commit verifier handlingTrond Myklebust1-122/+13
Instead of trying to save the commit verifiers and checking them against previous writes, adopt the same strategy as for buffered writes, of just checking the verifiers at commit time. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-27NFS: commit errors should be fatalTrond Myklebust1-2/+30
Fix the O_DIRECT code to avoid retries if the COMMIT fails with a fatal error. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-27NFS/pNFS: Allow O_DIRECT to release the DS commitinfoTrond Myklebust1-0/+1
Add a pNFS callback to allow the O_DIRECT code to release the DS commitinfo when freeing the dreq. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-27NFSv4/pnfs: Support a list of commit arrays in struct pnfs_ds_commit_infoTrond Myklebust1-0/+1
When we have multiple layout segments with different lists of mirrored data, we need to track the commits on a per layout segment basis. This patch adds a list to support this tracking in struct pnfs_ds_commit_info. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-22NFS: direct.c: Fix memory leak of dreq when nfs_get_lock_context failsMisono Tomohiro1-0/+2
When dreq is allocated by nfs_direct_req_alloc(), dreq->kref is initialized to 2. Therefore we need to call nfs_direct_req_release() twice to release the allocated dreq. Usually it is called in nfs_file_direct_{read, write}() and nfs_direct_complete(). However, current code only calls nfs_direct_req_relese() once if nfs_get_lock_context() fails in nfs_file_direct_{read, write}(). So, that case would result in memory leak. Fix this by adding the missing call. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-01-15NFS: Fix nfs_direct_write_reschedule_io()Trond Myklebust1-1/+2
The 'hdr->good_bytes' is defined as the number of bytes we expect to read or write starting at offset hdr->io_start. In the case of a partial read/write we may end up adjusting hdr->args.offset and hdr->args.count to skip I/O for data that was already read/written, and so we must ensure the calculation takes that into account. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15NFS/pnfs: Fix pnfs_generic_prepare_to_resend_writes()Trond Myklebust1-2/+2
Instead of making assumptions about the commit verifier contents, change the commit code to ensure we always check that the verifier was set by the XDR code. Fixes: f54bcf2ecee9 ("pnfs: Prepare for flexfiles by pulling out common code") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-09NFS: Remove redundant mirror tracking in O_DIRECTTrond Myklebust1-42/+0
We no longer need the extra mirror length tracking in the O_DIRECT code, as we are able to track the maximum contiguous length in dreq->max_count. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-09NFS: Fix O_DIRECT accounting of number of bytes read/writtenTrond Myklebust1-35/+43
When a series of O_DIRECT reads or writes are truncated, either due to eof or due to an error, then we should return the number of contiguous bytes that were received/sent starting at the offset specified by the application. Currently, we are failing to correctly check contiguity, and so we're failing the generic/465 in xfstests when the race between the read and write RPCs causes the file to get extended while the 2 reads are outstanding. If the first read RPC call wins the race and returns with eof set, we should treat the second read RPC as being truncated. Reported-by: Su Yanjun <suyj.fnst@cn.fujitsu.com> Fixes: 1ccbad9f9f9bd ("nfs: fix DIO good bytes calculation") Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-19NFS: Ensure O_DIRECT reports an error if the bytes read/written is 0Trond Myklebust1-9/+18
If the attempt to resend the I/O results in no bytes being read/written, we must ensure that we report the error. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Fixes: 0a00b77b331a ("nfs: mirroring support for direct io") Cc: stable@vger.kernel.org # v3.20+
2019-05-21treewide: Add SPDX license identifier for missed filesThomas Gleixner1-0/+1
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25pNFS: Add tracking to limit the number of pNFS retriesTrond Myklebust1-0/+7
When the client is reading or writing using pNFS, and hits an error on the DS, then it typically sends a LAYOUTERROR and/or LAYOUTRETURN to the MDS, before redirtying the failed pages, and going for a new round of reads/writebacks. The problem is that if the server has no way to fix the DS, then we may need a way to interrupt this loop after a set number of attempts have been made. This patch adds an optional module parameter that allows the admin to specify how many times to retry the read/writeback process before failing with a fatal error. The default behaviour is to retry forever. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25NFS: Remove unused argument from nfs_create_request()Trond Myklebust1-2/+2
All the callers of nfs_create_request() are now creating page group heads, so we can remove the redundant 'last' page argument. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-20NFS: Pass error information to the pgio error cleanup routineTrond Myklebust1-2/+2
Allow the caller to pass error information when cleaning up a failed I/O request so that we can conditionally take action to cancel the request altogether if the error turned out to be fatal. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Clean up list moves of struct nfs_pageTrond Myklebust1-2/+1
In several places we're just moving the struct nfs_page from one list to another by first removing from the existing list, then adding to the new one. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-12-02nfs: don't dirty kernel pages read by direct-ioDave Kleikamp1-1/+8
When we use direct_IO with an NFS backing store, we can trigger a WARNING in __set_page_dirty(), as below, since we're dirtying the page unnecessarily in nfs_direct_read_completion(). To fix, replicate the logic in commit 53cbf3b157a0 ("fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read"). Other filesystems that implement direct_IO handle this; most use blockdev_direct_IO(). ceph and cifs have similar logic. mount 127.0.0.1:/export /nfs dd if=/dev/zero of=/nfs/image bs=1M count=200 losetup --direct-io=on -f /nfs/image mkfs.btrfs /dev/loop0 mount -t btrfs /dev/loop0 /mnt/ kernel: WARNING: CPU: 0 PID: 8067 at fs/buffer.c:580 __set_page_dirty+0xaf/0xd0 kernel: Modules linked in: loop(E) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) fuse(E) tun(E) ip6t_rpfilter(E) ipt_REJECT(E) nf_ kernel: snd_seq(E) snd_seq_device(E) snd_pcm(E) video(E) snd_timer(E) snd(E) soundcore(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E) sr_mod(E) cdrom(E) ata_generic(E) pata_acpi(E) crc32c_intel(E) ahci(E) li kernel: CPU: 0 PID: 8067 Comm: kworker/0:2 Tainted: G E 4.20.0-rc1.master.20181111.ol7.x86_64 #1 kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 kernel: Workqueue: nfsiod rpc_async_release [sunrpc] kernel: RIP: 0010:__set_page_dirty+0xaf/0xd0 kernel: Code: c3 48 8b 02 f6 c4 04 74 d4 48 89 df e8 ba 05 f7 ff 48 89 c6 eb cb 48 8b 43 08 a8 01 75 1f 48 89 d8 48 8b 00 a8 04 74 02 eb 87 <0f> 0b eb 83 48 83 e8 01 eb 9f 48 83 ea 01 0f 1f 00 eb 8b 48 83 e8 kernel: RSP: 0000:ffffc1c8825b7d78 EFLAGS: 00013046 kernel: RAX: 000fffffc0020089 RBX: fffff2b603308b80 RCX: 0000000000000001 kernel: RDX: 0000000000000001 RSI: ffff9d11478115c8 RDI: ffff9d11478115d0 kernel: RBP: ffffc1c8825b7da0 R08: 0000646f6973666e R09: 8080808080808080 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff9d11478115d0 kernel: R13: ffff9d11478115c8 R14: 0000000000003246 R15: 0000000000000001 kernel: FS: 0000000000000000(0000) GS:ffff9d115ba00000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007f408686f640 CR3: 0000000104d8e004 CR4: 00000000000606f0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kernel: Call Trace: kernel: __set_page_dirty_buffers+0xb6/0x110 kernel: set_page_dirty+0x52/0xb0 kernel: nfs_direct_read_completion+0xc4/0x120 [nfs] kernel: nfs_pgio_release+0x10/0x20 [nfs] kernel: rpc_free_task+0x30/0x70 [sunrpc] kernel: rpc_async_release+0x12/0x20 [sunrpc] kernel: process_one_work+0x174/0x390 kernel: worker_thread+0x4f/0x3e0 kernel: kthread+0x102/0x140 kernel: ? drain_workqueue+0x130/0x130 kernel: ? kthread_stop+0x110/0x110 kernel: ret_from_fork+0x35/0x40 kernel: ---[ end trace 01341980905412c9 ]--- Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [forward-ported to v4.20] Signed-off-by: Calum Mackay <calum.mackay@oracle.com> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-08-08NFS: Use an appropriate work queue for direct-write completionNeilBrown1-1/+1
When a direct-write completes, a work_struct is schedule to handle the completion. When NFS is being used for swap, the direct write might be a swap-out, so memory allocation can block until the write completes. The work queue currently used is not WQ_MEM_RECLAIM, so tasks can block waiting for memory - this leads to deadlock. So use nfsiod_workqueue instead. This will always have a running thread, and work items should never block waiting for memory. Signed-off-by: Neil Brown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-03-08NFS: Fix an incorrect type in struct nfs_direct_reqTrond Myklebust1-1/+1
The start offset needs to be of type loff_t. Fixed: 5fadeb47dcc5c ("nfs: count DIO good bytes correctly with mirroring") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-16NFS: commit direct writes even if they fail partiallyJ. Bruce Fields1-3/+1
If some of the WRITE calls making up an O_DIRECT write syscall fail, we neglect to commit, even if some of the WRITEs succeed. We also depend on the commit code to free the reference count on the nfs_page taken in the "if (request_commit)" case at the end of nfs_direct_write_completion(). The problem was originally noticed because ENOSPC's encountered partway through a write would result in a closed file being sillyrenamed when it should have been unlinked. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4: Use a mutex to protect the per-inode commit listsTrond Myklebust1-2/+2
The commit lists can get very large, so using the inode->i_lock can end up affecting general metadata performance. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-10Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-19/+2
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - Fix use after free in write error path - Use GFP_NOIO for two allocations in writeback - Fix a hang in OPEN related to server reboot - Check the result of nfs4_pnfs_ds_connect - Fix an rcu lock leak Features: - Removal of the unmaintained and unused OSD pNFS layout - Cleanup and removal of lots of unnecessary dprintk()s - Cleanup and removal of some memory failure paths now that GFP_NOFS is guaranteed to never fail. - Remove the v3-only data server limitation on pNFS/flexfiles Bugfixes: - RPC/RDMA connection handling bugfixes - Copy offload: fixes to ensure the copied data is COMMITed to disk. - Readdir: switch back to using the ->iterate VFS interface - File locking fixes from Ben Coddington - Various use-after-free and deadlock issues in pNFS - Write path bugfixes" * tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits) pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled NFSv4.1: Work around a Linux server bug... NFS append COMMIT after synchronous COPY NFSv4: Fix exclusive create attributes encoding NFSv4: Fix an rcu lock leak nfs: use kmap/kunmap directly NFS: always treat the invocation of nfs_getattr as cache hit when noac is on Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits pNFS: Fix a typo in pnfs_generic_alloc_ds_commits pNFS: Fix a deadlock when coalescing writes and returning the layout pNFS: Don't clear the layout return info if there are segments to return pNFS: Ensure we commit the layout if it has been invalidated pNFS: Don't send COMMITs to the DSes if the server invalidated our layout pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path pNFS: Ensure we check layout validity before marking it for return NFS4.1 handle interrupted slot reuse from ERR_DELAY NFSv4: check return value of xdr_inline_decode nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout() ...
2017-04-20NFS: Clean up nfs_direct_commit_complete()Anna Schumaker1-8/+1
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20NFS: Remove nfs_direct_readpage_release()Anna Schumaker1-11/+1
Just remove the function and have the caller use nfs_release_request() instead. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-17fix nfs O_DIRECT advancing iov_iter too muchAl Viro1-9/+18
It leaves the iterator advanced by the amount of IO it has requested instead of the amount actually transferred. Among other things, that confuses the hell out of generic_file_splice_read(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds1-1/+1
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-01NFS: Remove unused argument from nfs_direct_write_complete()Anna Schumaker1-6/+6
This parameter hasn't been used since 2a009ec9 (Linux 3.13-rc3), so let's remove it from this function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-23NFS: direct: use complete() instead of complete_all()Daniel Wagner1-1/+1
There is only one waiter for the completion, therefore there is no need to use complete_all(). Let's make that clear by using complete() instead of complete_all(). nfs_file_direct_write() or nfs_file_direct_read() allocated a request object via nfs_direct_req_alloc(), which initializes the completion. The request object then is freed later in the exit path. Between the initialization and the release either nfs_direct_write_schedule_iovec() resp nfs_direct_read_schedule_iovec() are called which will asynchronously process the request. The calling function waits via nfs_direct_wait() till the async work has been done. Thus there is only one waiter on the completion. nfs_direct_pgio_init() and nfs_direct_read_completion() are passed via function pointers to nfs pageio. The first function does a ref counting (get_dreq() and put_dreq()) which ensures that nfs_direct_read_completion() and nfs_direct_read_schedule_iovec() only call the completion path once. The usage pattern of the completion is: waiter context waker context nfs_file_direct_write() dreq = nfs_direct_req_alloc() init_completion() nfs_direct_write_schedule_iovec() nfs_direct_wait() wait_for_completion_killable() nfs_direct_write_schedule_work() nfs_direct_complete() complete() nfs_file_direct_read() dreq = nfs_direct_req_all() init_completion() nfs_direct_read_schedule_iovec() nfs_direct_wait() wait_for_completion_killable() nfs_direct_read_schedule_iovec() nfs_direct_complete() complete() nfs_direct_read_completion() nfs_direct_complete() complete() Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-30Merge tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-61/+32
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - nfs: don't create zero-length requests - several LAYOUTGET bugfixes Features: - several performance related features - more aggressive caching when we can rely on close-to-open cache consistency - remove serialisation of O_DIRECT reads and writes - optimise several code paths to not flush to disk unnecessarily. However allow for the idiosyncracies of pNFS for those layout types that need to issue a LAYOUTCOMMIT before the metadata can be updated on the server. - SUNRPC updates to the client data receive path - pNFS/SCSI support RH/Fedora dm-mpath device nodes - pNFS files/flexfiles can now use unprivileged ports when the generic NFS mount options allow it. Bugfixes: - Don't use RDMA direct data placement together with data integrity or privacy security flavours - Remove the RDMA ALLPHYSICAL memory registration mode as it has potential security holes. - Several layout recall fixes to improve NFSv4.1 protocol compliance. - Fix an Oops in the pNFS files and flexfiles connection setup to the DS - Allow retry of operations that used a returned delegation stateid - Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding - Fix writeback races in nfs4_copy_range() and nfs42_proc_deallocate()" * tag 'nfs-for-4.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (104 commits) pNFS: Actively set attributes as invalid if LAYOUTCOMMIT is outstanding NFSv4: Clean up lookup of SECINFO_NO_NAME NFSv4.2: Fix warning "variable ‘stateids’ set but not used" NFSv4: Fix warning "no previous prototype for ‘nfs4_listxattr’" SUNRPC: Fix a compiler warning in fs/nfs/clnt.c pNFS: Remove redundant smp_mb() from pnfs_init_lseg() pNFS: Cleanup - do layout segment initialisation in one place pNFS: Remove redundant stateid invalidation pNFS: Remove redundant pnfs_mark_layout_returned_if_empty() pNFS: Clear the layout metadata if the server changed the layout stateid pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid() NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id pNFS: Do not set plh_return_seq for non-callback related layoutreturns pNFS: Ensure layoutreturn acts as a completion for layout callbacks pNFS: Fix CB_LAYOUTRECALL stateid verification pNFS: Always update the layout barrier seqid on LAYOUTGET pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set pNFS: Clear the layout return tracking on layout reinitialisation pNFS: LAYOUTRETURN should only update the stateid if the layout is valid nfs: don't create zero-length requests ...
2016-07-28Merge branch 'work.misc' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted cleanups and fixes. Probably the most interesting part long-term is ->d_init() - that will have a bunch of followups in (at least) ceph and lustre, but we'll need to sort the barrier-related rules before it can get used for really non-trivial stuff. Another fun thing is the merge of ->d_iput() callers (dentry_iput() and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all except the one in __d_lookup_lru())" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) fs/dcache.c: avoid soft-lockup in dput() vfs: new d_init method vfs: Update lookup_dcache() comment bdev: get rid of ->bd_inodes Remove last traces of ->sync_page new helper: d_same_name() dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends() vfs: clean up documentation vfs: document ->d_real() vfs: merge .d_select_inode() into .d_real() unify dentry_iput() and dentry_unlink_inode() binfmt_misc: ->s_root is not going anywhere drop redundant ->owner initializations ufs: get rid of redundant checks orangefs: constify inode_operations missed comment updates from ->direct_IO() prototype change file_inode(f)->i_mapping is f->f_mapping trim fsnotify hooks a bit 9p: new helper - v9fs_parent_fid() debugfs: ->d_parent is never NULL or negative ...
2016-07-24Merge branch 'writeback'Trond Myklebust1-61/+32
2016-07-05NFS: Cleanup nfs_direct_complete()Trond Myklebust1-7/+5
There is only one caller that sets the "write" argument to true, so just move the call to nfs_zap_mapping() and get rid of the now redundant argument. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-05NFS: Do not serialise O_DIRECT reads and writesTrond Myklebust1-32/+9
Allow dio requests to be scheduled in parallel, but ensuring that they do not conflict with buffered I/O. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-05NFS Cleanup: move call to generic_write_checks() into fs/nfs/direct.cTrond Myklebust1-4/+8
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-05NFS: Remove racy size manipulations in O_DIRECTTrond Myklebust1-16/+0
On success, the RPC callbacks will ensure that we make the appropriate calls to nfs_writeback_update_inode() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-05NFS: Ensure we reset the write verifier 'committed' value on resend.Trond Myklebust1-0/+2
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-05NFS: Fix O_DIRECT verifier problemsTrond Myklebust1-2/+8
We should not be interested in looking at the value of the stable field, since that could take any value. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-24NFS: Fix up O_DIRECT resultsTrond Myklebust1-3/+7
if we read or wrote something, we must report it Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-29missed comment updates from ->direct_IO() prototype changeAl Viro1-3/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-26Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-7/+10
Pull NFS client updates from Anna Schumaker: "Highlights include: Features: - Add support for the NFS v4.2 COPY operation - Add support for NFS/RDMA over IPv6 Bugfixes and cleanups: - Avoid race that crashes nfs_init_commit() - Fix oops in callback path - Fix LOCK/OPEN race when unlinking an open file - Choose correct stateids when using delegations in setattr, read and write - Don't send empty SETATTR after OPEN_CREATE - xprtrdma: Prevent server from writing a reply into memory client has released - xprtrdma: Support using Read list and Reply chunk in one RPC call" * tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits) pnfs: pnfs_update_layout needs to consider if strict iomode checking is on nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO nfs: avoid race that crashes nfs_init_commit NFS: checking for NULL instead of IS_ERR() in nfs_commit_file() pnfs: make pnfs_layout_process more robust pnfs: rework LAYOUTGET retry handling pnfs: lift retry logic from send_layoutget to pnfs_update_layout pnfs: fix bad error handling in send_layoutget flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args pnfs: keep track of the return sequence number in pnfs_layout_hdr pnfs: record sequence in pnfs_layout_segment when it's created pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io pNFS/flexfile: Fix erroneous fall back to read/write through the MDS NFS: Reclaim writes via writepage are opportunistic NFSv4: Use the right stateid for delegations in setattr, read and write ...
2016-05-17Merge branch 'work.preadv2' of ↵Linus Torvalds1-11/+10
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs cleanups from Al Viro: "More cleanups from Christoph" * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nfsd: use RWF_SYNC fs: add RWF_DSYNC aand RWF_SYNC ceph: use generic_write_sync fs: simplify the generic_write_sync prototype fs: add IOCB_SYNC and IOCB_DSYNC direct-io: remove the offset argument to dio_complete direct-io: eliminate the offset argument to ->direct_IO xfs: eliminate the pos variable in xfs_file_dio_aio_write filemap: remove the pos argument to generic_file_direct_write filemap: remove pos variables in generic_file_read_iter
2016-05-09NFS: Save struct inode * inside nfs_commit_info to clarify usage of i_lockDave Wysochanski1-5/+5
Commit ea2cf22 created nfs_commit_info and saved &inode->i_lock inside this NFS specific structure. This obscures the usage of i_lock. Instead, save struct inode * so later it's clear the spinlock taken is i_lock. Should be no functional change. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: add debug to directio "good_bytes" countingWeston Andros Adamson1-2/+5
This will pop a warning if we count too many good bytes. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-02Merge getxattr prototype change into work.lookupsAl Viro1-1/+1
The rest of work.xattr stuff isn't needed for this branch
2016-05-01fs: simplify the generic_write_sync prototypeChristoph Hellwig1-1/+3
The kiocb already has the new position, so use that. The only interesting case is AIO, where we currently don't bother updating ki_pos. We're about to free the kiocb after we're done, so we might as well update it to make everyone's life simpler. While we're at it also return the bytes written argument passed in if we were successful so that the boilerplate error switch code in the callers can go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01fs: add IOCB_SYNC and IOCB_DSYNCChristoph Hellwig1-1/+1
This will allow us to do per-I/O sync file writes, as required by a lot of fileservers or storage targets. XXX: Will need a few additional audits for O_DSYNC Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01direct-io: eliminate the offset argument to ->direct_IOChristoph Hellwig1-10/+7
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually work, so eliminate the superflous argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10don't bother with ->d_inode->i_sb - it's always equal to ->d_sbAl Viro1-1/+1
... and neither can ever be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-4/+4
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22wrappers for ->i_mutex accessAl Viro1-6/+6
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-04Merge branch 'pnfs_generic'Trond Myklebust1-9/+24
* pnfs_generic: NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid() NFSv4.1/pNFS: Fix a race in initiate_file_draining() NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn() NFS: Relax requirements in nfs_flush_incompatible NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid NFS: Allow multiple commit requests in flight per file NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs NFSv4: List stateid information in the callback tracepoints NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1 pNFS: If we have to delay the layout callback, mark the layout for return NFSv4.1/pNFS: Add a helper to mark the layout as returned pNFS: Ensure nfs4_layoutget_prepare returns the correct error
2015-12-31NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalidTrond Myklebust1-0/+12
If the layout segment is invalid, then we should not be adding more write requests to the commit list. Instead, those writes should be replayed after requesting a new layout. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-31NFS: Allow multiple commit requests in flight per fileTrond Myklebust1-6/+0
Allow synchronous RPC calls to wait for pending RPC calls to finish, but also allow asynchronous ones to just fire off another commit. With this patch, the xfstests generic/074 test completes in 226s instead of 242s Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-31NFS/pNFS: Fix up pNFS write reschedule layering violations and bugsTrond Myklebust1-6/+15
The flexfiles layout in particular, seems to want to poke around in the O_DIRECT flags when retransmitting. This patch sets up an interface to allow it to call back into O_DIRECT to handle retransmission correctly. It also fixes a potential bug whereby we could change the behaviour of O_DIRECT if an error is already pending. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-28NFS41: pop some layoutget errors to applicationPeng Tao1-1/+14
For ERESTARTSYS/EIO/EROFS/ENOSPC/E2BIG in layoutget, we should just bail out instead of hiding the error and retrying inband IO. Change all the call sites to pop the error all the way up. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-22NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is setKinglong Mee1-2/+5
When lseg's commit_through_mds is set, pnfs client always WARN once in nfs_direct_select_verf after checking ds_cinfo.nbuckets. nfs should use the DS verf except commit_through_mds is set for layout segment where nbuckets is zero. [17844.666094] ------------[ cut here ]------------ [17844.667071] WARNING: CPU: 0 PID: 21758 at /root/source/linux-pnfs/fs/nfs/direct.c:174 nfs_direct_select_verf+0x5a/0x70 [nfs]() [17844.668650] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c btrfs ppdev coretemp crct10dif_pclmul auth_rpcgss crc32_pclmul crc32c_intel nfs_acl ghash_clmulni_intel lockd vmw_balloon xor vmw_vmci grace raid6_pq shpchp sunrpc parport_pc i2c_piix4 parport vmwgfx drm_kms_helper ttm drm serio_raw mptspi e1000 scsi_transport_spi mptscsih mptbase ata_generic pata_acpi [last unloaded: fscache] [17844.686676] CPU: 0 PID: 21758 Comm: kworker/0:1 Tainted: G W OE 4.3.0-rc1-pnfs+ #245 [17844.687352] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 [17844.698502] Workqueue: nfsiod rpc_async_release [sunrpc] [17844.699212] 0000000000000009 0000000043e58010 ffff8800454fbc10 ffffffff813680c4 [17844.699990] ffff8800454fbc48 ffffffff8108b49d ffff88004eb20000 ffff88004eb20000 [17844.700844] ffff880062e26000 0000000000000000 0000000000000001 ffff8800454fbc58 [17844.701637] Call Trace: [17844.725252] [<ffffffff813680c4>] dump_stack+0x19/0x25 [17844.732693] [<ffffffff8108b49d>] warn_slowpath_common+0x7d/0xb0 [17844.733855] [<ffffffff8108b5da>] warn_slowpath_null+0x1a/0x20 [17844.735015] [<ffffffffa04a27ca>] nfs_direct_select_verf+0x5a/0x70 [nfs] [17844.735999] [<ffffffffa04a2b83>] nfs_direct_set_hdr_verf+0x23/0x90 [nfs] [17844.736846] [<ffffffffa04a2e17>] nfs_direct_write_completion+0x227/0x260 [nfs] [17844.737782] [<ffffffffa04a433c>] nfs_pgio_release+0x1c/0x20 [nfs] [17844.738597] [<ffffffffa0502df3>] pnfs_generic_rw_release+0x23/0x30 [nfsv4] [17844.739486] [<ffffffffa01cbbea>] rpc_free_task+0x2a/0x70 [sunrpc] [17844.740326] [<ffffffffa01cbcd5>] rpc_async_release+0x15/0x20 [sunrpc] [17844.741173] [<ffffffff810a387c>] process_one_work+0x21c/0x4c0 [17844.741984] [<ffffffff810a37cd>] ? process_one_work+0x16d/0x4c0 [17844.742837] [<ffffffff810a3b6a>] worker_thread+0x4a/0x440 [17844.743639] [<ffffffff810a3b20>] ? process_one_work+0x4c0/0x4c0 [17844.744399] [<ffffffff810a3b20>] ? process_one_work+0x4c0/0x4c0 [17844.745176] [<ffffffff810a8d75>] kthread+0xf5/0x110 [17844.745927] [<ffffffff810a8c80>] ? kthread_create_on_node+0x240/0x240 [17844.747105] [<ffffffff8172ce1f>] ret_from_fork+0x3f/0x70 [17844.747856] [<ffffffff810a8c80>] ? kthread_create_on_node+0x240/0x240 [17844.748642] ---[ end trace 336a2845d42b83f0 ]--- Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-26Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-21/+18
Pull NFS client updates from Trond Myklebust: "Another set of mainly bugfixes and a couple of cleanups. No new functionality in this round. Highlights include: Stable patches: - Fix a regression in /proc/self/mountstats - Fix the pNFS flexfiles O_DIRECT support - Fix high load average due to callback thread sleeping Bugfixes: - Various patches to fix the pNFS layoutcommit support - Do not cache pNFS deviceids unless server notifications are enabled - Fix a SUNRPC transport reconnection regression - make debugfs file creation failure non-fatal in SUNRPC - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints - Fix locking around NFSv4.2 fallocate() support - Truncating NFSv4 file opens should also sync O_DIRECT writes - Prevent infinite loop in rpcrdma_ep_create() Features: - Various improvements to the RDMA transport code's handling of memory registration - Various code cleanups" * tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits) fs/nfs: fix new compiler warning about boolean in switch nfs: Remove unneeded casts in nfs NFS: Don't attempt to decode missing directory entries Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one" NFS: Rename idmap.c to nfs4idmap.c NFS: Move nfs_idmap.h into fs/nfs/ NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h NFS: Add a stub for GETDEVICELIST nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes nfs: fix DIO good bytes calculation nfs: Fetch MOUNTED_ON_FILEID when updating an inode sunrpc: make debugfs file creation failure non-fatal nfs: fix high load average due to callback thread sleeping NFS: Reduce time spent holding the i_mutex during fallocate() NFS: Don't zap caches on fallocate() xprtrdma: Make rpcrdma_{un}map_one() into inline functions xprtrdma: Handle non-SEND completions via a callout xprtrdma: Add "open" memreg op xprtrdma: Add "destroy MRs" memreg op xprtrdma: Add "reset MRs" memreg op ...
2015-04-24direct-io: only inc/dec inode->i_dio_count for file systemsJens Axboe1-5/+5
do_blockdev_direct_IO() increments and decrements the inode ->i_dio_count for each IO operation. It does this to protect against truncate of a file. Block devices don't need this sort of protection. For a capable multiqueue setup, this atomic int is the only shared state between applications accessing the device for O_DIRECT, and it presents a scaling wall for that. In my testing, as much as 30% of system time is spent incrementing and decrementing this value. A mixed read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with better latencies too. Before: clat percentiles (usec): | 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34], | 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35], | 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80], | 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155], | 99.99th=[ 165] After: clat percentiles (usec): | 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149], | 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171], | 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270], | 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422], | 99.99th=[ 438] In other setups, Robert Elliott reported seeing good performance improvements: https://lkml.org/lkml/2015/4/3/557 The more applications accessing the device, the worse it gets. Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells do_blockdev_direct_IO() that it need not worry about incrementing or decrementing the inode i_dio_count for this caller. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Elliott, Robert (Server Storage) <elliott@hp.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-23nfs: remove WARN_ON_ONCE from nfs_direct_good_bytesPeng Tao1-2/+0
For flexfiles driver, we might choose to read from mirror index other than 0 while mirror_count is always 1 for read. Reported-by: Jean Spector <jean@primarydata.com> Cc: <stable@vger.kernel.org> # v3.19+ Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23nfs: fix DIO good bytes calculationPeng Tao1-12/+17
For direct read that has IO size larger than rsize, we'll split it into several READ requests and nfs_direct_good_bytes() would count completed bytes incorrectly by eating last zero count reply. Fix it by handling mirror and non-mirror cases differently such that we only count mirrored writes differently. This fixes 5fadeb47("nfs: count DIO good bytes correctly with mirroring"). Reported-by: Jean Spector <jean@primarydata.com> Cc: <stable@vger.kernel.org> # v3.19+ Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells1-2/+2
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15nfs: generic_write_checks() shouldn't be done on swapout...Al Viro1-9/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11switch generic_write_checks() to iocb and iterAl Viro1-15/+10
... returning -E... upon error and amount of data left in iter after (possible) truncation upon success. Note, that normal case gives a non-zero (positive) return value, so any tests for != 0 _must_ be updated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: fs/ext4/file.c
2015-04-11generic_write_checks(): drop isblk argumentAl Viro1-1/+1
all remaining callers are passing 0; some just obscure that fact. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11direct_IO: remove rw from a_ops->direct_IO()Omar Sandoval1-2/+1
Now that no one is using rw, remove it completely. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11direct_IO: use iov_iter_rw() instead of rw everywhereOmar Sandoval1-1/+1
The rw parameter to direct_IO is redundant with iov_iter->type, and treated slightly differently just about everywhere it's used: some users do rw & WRITE, and others do rw == WRITE where they should be doing a bitwise check. Simplify this with the new iov_iter_rw() helper, which always returns either READ or WRITE. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11Merge branch 'iocb' into for-nextAl Viro1-2/+2
2015-03-27NFSv4.1/pnfs: Ensure that writes respect the O_SYNC flag when doing O_DIRECTTrond Myklebust1-0/+1
If the caller does not specify the O_SYNC flag, then it is legitimate to return from O_DIRECT without doing a pNFS layoutcommit operation. However if the file is opened O_DIRECT|O_SYNC then we'd better get it right. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-13fs: split generic and aio kiocbChristoph Hellwig1-1/+1
Most callers in the kernel want to perform synchronous file I/O, but still have to bloat the stack with a full struct kiocb. Split out the parts needed in filesystem code from those in the aio code, and only allocate those needed to pass down argument on the stack. The aio code embedds the generic iocb in the one it allocates and can easily get back to it by using container_of. Also add a ->ki_complete method to struct kiocb, this is used to call into the aio code and thus removes the dependency on aio for filesystems impementing asynchronous operations. It will also allow other callers to substitute their own completion callback. We also add a new ->ki_flags field to work around the nasty layering violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add eventfd notification about ffs events"). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-13nfs: clean up nfs_direct_IOPeng Tao1-7/+0
This follows up "nfs: fix dio deadlock when O_DIRECT flag is flipped" and removes the unnecessary CONFIG_NFS_SWAP switch. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-12fs: remove ki_nbytesChristoph Hellwig1-1/+1
There is no need to pass the total request length in the kiocb, as we already get passed in through the iov_iter argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-13NFS: struct nfs_commit_info.lock must always point to inode->i_lockTrond Myklebust1-1/+1
Commit 411a99adffb4f (nfs: clear_request_commit while holding i_lock) assumes that the nfs_commit_info always points to the inode->i_lock. For historical reasons, that is not the case for O_DIRECT writes. Cc: Weston Andros Adamson <dros@primarydata.com> Fixes: 411a99adffb4f ("nfs: clear_request_commit while holding i_lock") Cc: stable@vger.kernel.org # 3.17.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-03Merge branch 'flexfiles'Trond Myklebust1-17/+95
* flexfiles: (53 commits) pnfs: lookup new lseg at lseg boundary nfs41: .init_read and .init_write can be called with valid pg_lseg pnfs: Update documentation on the Layout Drivers pnfs/flexfiles: Add the FlexFile Layout Driver nfs: count DIO good bytes correctly with mirroring nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags nfs/flexfiles: send layoutreturn before freeing lseg nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE nfs41: allow async version layoutreturn nfs41: add range to layoutreturn args pnfs: allow LD to ask to resend read through pnfs nfs: add nfs_pgio_current_mirror helper nfs: only reset desc->pg_mirror_idx when mirroring is supported nfs41: add a debug warning if we destroy an unempty layout pnfs: fail comparison when bucket verifier not set nfs: mirroring support for direct io nfs: add mirroring support to pgio layer pnfs: pass ds_commit_idx through the commit path ... Conflicts: fs/nfs/pnfs.c fs/nfs/pnfs.h
2015-02-03nfs: count DIO good bytes correctly with mirroringPeng Tao1-4/+8
When resending to MDS, we might resend multiple mirroring requests to MDS. As a result, nfs_direct_good_bytes() ends up counting bytes multiple times, causing application to get wrong return results in read/write syscalls. Fix it by tracking start of a dreq and checking the range of pgio header. Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com>
2015-02-03nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writesPeng Tao1-0/+6
To allow pnfs LD to ask direct writes to be resend. Signed-off-by: Peng Tao <tao.peng@primarydata.com>
2015-02-03pnfs: fail comparison when bucket verifier not setWeston Andros Adamson1-1/+5
This skips the WARN_ON_ONCE, but doesnt change behavior (the memcmp would fail). Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03nfs: mirroring support for direct ioWeston Andros Adamson1-14/+57
The current mirroring code only notices short writes to the first mirror. This patch keeps per-mirror byte counts and only considers a byte to be written once all mirrors report so. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03nfs: add mirroring support to pgio layerWeston Andros Adamson1-3/+14
This patch adds mirrored write support to the pgio layer. The default is to use one mirror, but pgio callers may define callbacks to change this to any value up to the (arbitrarily selected) limit of 16. The basic idea is to break out members of nfs_pageio_descriptor that cannot be shared between mirrored DSes and put them in a new structure. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03pnfs: pass ds_commit_idx through the commit pathWeston Andros Adamson1-2/+3
Pass ds_commit_idx through the nfs commit path. It's used to select the commit bucket when using pnfs and is ignored when not using pnfs. Several functions had to be changed: nfs_retry_commit, nfs_mark_request_commit, pnfs_mark_request_commit and the pnfs layout driver .mark_request_commit functions. Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03nfs: rename pgio header ds_idx to ds_commit_idxWeston Andros Adamson1-8/+6
'ds_commit_idx' is a better name - it is used to select the right commit bucket for pnfs. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03pnfs: Do not grab the commit_info lock twice when rescheduling writesTom Haynes1-4/+15
Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-01-21nfs: fix dio deadlock when O_DIRECT flag is flippedPeng Tao1-0/+6
We only support swap file calling nfs_direct_IO. However, application might be able to get to nfs_direct_IO if it toggles O_DIRECT flag during IO and it can deadlock because we grab inode->i_mutex in nfs_file_direct_write(). So return 0 for such case. Then the generic layer will fall back to buffer IO. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12nfs: fix pnfs direct write memory leakPeng Tao1-0/+1
For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets. So we need to take care to free them when freeing dreq. Ideally this needs to be done inside layout driver where ds_cinfo.buckets are allocated. But buckets are attached to dreq and reused across LD IO iterations. So I feel it's OK to free them in the generic layer. Cc: stable@vger.kernel.org [v3.4+] Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-10-18Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+5
Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...
2014-10-14block: Remove REQ_KERNELMartin K. Petersen1-7/+5
REQ_KERNEL is no longer used. Remove it and drop the redundant uio argument to nfs_file_direct_{read,write}. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-12NFS: Unconditionally enable commit codeAnna Schumaker1-14/+0
The goal is to create a generic NFS module with code that does not depend on what versions of NFS are enabled. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-13Merge branch 'bugfixes' into linux-nextTrond Myklebust1-2/+0
* bugfixes: NFS: Don't reset pg_moreio in __nfs_pageio_add_request NFS: Remove 2 unused variables nfs: handle multiple reqs in nfs_wb_page_cancel nfs: handle multiple reqs in nfs_page_async_flush nfs: change find_request to find_head_request nfs: nfs_page should take a ref on the head req nfs: mark nfs_page reqs with flag for extra ref nfs: only show Posix ACLs in listxattr if actually present Conflicts: fs/nfs/write.c
2014-07-12NFS: Remove 2 unused variablesTrond Myklebust1-2/+0
Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: remove unused writeverf codeWeston Andros Adamson1-17/+8
Remove duplicate writeverf structure from merge of nfs_pgio_header and nfs_pgio_data and remove writeverf related flags and logic to handle more than one RPC per nfs_pgio_header. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: merge nfs_pgio_data into _headerWeston Andros Adamson1-4/+4
struct nfs_pgio_data only exists as a member of nfs_pgio_header, but is passed around everywhere, because there used to be multiple _data structs per _header. Many of these functions then use the _data to find a pointer to the _header. This patch cleans this up by merging the nfs_pgio_data structure into nfs_pgio_header and passing nfs_pgio_header around instead. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24nfs: move nfs_pgio_data and remove nfs_rw_headerWeston Andros Adamson1-4/+4
nfs_rw_header was used to allocate an nfs_pgio_header along with an nfs_pgio_data, because a _header would need at least one _data. Now there is only ever one nfs_pgio_data for each nfs_pgio_header -- move it to nfs_pgio_header and get rid of nfs_rw_header. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds1-224/+102
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-05-29pnfs: support multiple verfs per direct reqWeston Andros Adamson1-5/+97
Support direct requests that span multiple pnfs data servers by comparing nfs_pgio_header->verf to a cached verf in pnfs_commit_bucket. Continue to use dreq->verf if the MDS is used / non-pNFS. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29nfs: add support for multiple nfs reqs per pageWeston Andros Adamson1-2/+5
Add "page groups" - a circular list of nfs requests (struct nfs_page) that all reference the same page. This gives nfs read and write paths the ability to account for sub-page regions independently. This somewhat follows the design of struct buffer_head's sub-page accounting. Only "head" requests are ever added/removed from the inode list in the buffered write path. "head" and "sub" requests are treated the same through the read path and the rest of the write/commit path. Requests are given an extra reference across the life of the list. Page groups are never rejoined after being split. If the read/write request fails and the client falls back to another path (ie revert to MDS in PNFS case), the already split requests are pushed through the recoalescing code again, which may split them further and then coalesce them into properly sized requests on the wire. Fragmentation shouldn't be a problem with the current design, because we flush all requests in page group when a non-contiguous request is added, so the only time resplitting should occur is on a resend of a read or write. This patch lays the groundwork for sub-page splitting, but does not actually do any splitting. For now all page groups have one request as pg_test functions don't yet split pages. There are several related patches that are needed support multiple requests per page group. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-29nfs: remove unused arg from nfs_create_requestWeston Andros Adamson1-4/+2
@inode is passed but not used. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28NFS: Move the write verifier into the nfs_pgio_headerAnna Schumaker1-2/+2
The header had a pointer to the verifier that was set from the old write data struct. We don't need to keep the pointer around now that we have shared structures. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28nfs: remove ->read_pageio_init from rpc opsChristoph Hellwig1-1/+1
The read_pageio_init method is just a very convoluted way to grab the right nfs_pageio_ops vector. The vector to chose is not a choice of protocol version, but just a pNFS vs MDS I/O choice that can simply be done inside nfs_pageio_init_read based on the presence of a layout driver, and a new force_mds flag to the special case of falling back to MDS I/O on a pNFS-capable volume. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-28nfs: remove ->write_pageio_init from rpc opsChristoph Hellwig1-2/+2
The write_pageio_init method is just a very convoluted way to grab the right nfs_pageio_ops vector. The vector to chose is not a choice of protocol version, but just a pNFS vs MDS I/O choice that can simply be done inside nfs_pageio_init_write based on the presence of a layout driver, and a new force_mds flag to the special case of falling back to MDS I/O on a pNFS-capable volume. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-06new helper: iov_iter_get_pages_alloc()Al Viro1-202/+88
same as iov_iter_get_pages(), except that pages array is allocated (kmalloc if possible, vmalloc if that fails) and left for caller to free. Lustre and NFS ->direct_IO() switched to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06get rid of pointless iov_length() in ->direct_IO()Al Viro1-7/+3
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06convert the guts of nfs_direct_IO() to iov_iterAl Viro1-25/+21
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06pass iov_iter to ->direct_IO()Al Viro1-4/+4
unmodified, for now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-13nfs: page cache invalidation for dioChristoph Hellwig1-2/+27
Make sure to properly invalidate the pagecache before performing direct I/O, so that no stale pages are left around. This matches what the generic direct I/O code does. Also take the i_mutex over the direct write submission to avoid the lifelock vs truncate waiting for i_dio_count to decrease, and to avoid having the pagecache easily repopulated while direct I/O is in progrss. Again matching the generic direct I/O code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-13nfs: take i_mutex during direct I/O readsChristoph Hellwig1-2/+12
We'll need the i_mutex to prevent i_dio_count from incrementing while truncate is waiting for it to reach zero, and protects against having the pagecache repopulated after we flushed it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-13nfs: merge nfs_direct_write into nfs_file_direct_writeChristoph Hellwig1-50/+41
Simple code cleanup to prepare for later fixes. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-13nfs: merge nfs_direct_read into nfs_file_direct_readChristoph Hellwig1-58/+49
Simple code cleanup to prepare for later fixes. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-13nfs: increment i_dio_count for reads, tooChristoph Hellwig1-3/+6
i_dio_count is used to protect dio access against truncate. We want to make sure there are no dio reads pending either when doing a truncate. I suspect on plain NFS things might work even without this, but once we use a pnfs layout driver that access backing devices directly things will go bad without the proper synchronization. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-13nfs: defer inode_dio_done call until size update is doneChristoph Hellwig1-17/+15
We need to have the I/O fully finished before telling the truncate code that we are done. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-13nfs: fix size updates for aio writesChristoph Hellwig1-5/+16
nfs_file_direct_write only updates the inode size if it succeeded and returned the number of bytes written. But in the AIO case nfs_direct_wait turns the return value into -EIOCBQUEUED and we skip the size update. Instead the aio completion path should updated it, which this patch does. The implementation is a little hacky because there is no obvious way to find out we are called for a write in nfs_direct_complete. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-05NFS: dprintk() should not print negative fileids and inode numbersNiels de Vos1-2/+2
A fileid in NFS is a uint64. There are some occurrences where dprintk() outputs a signed fileid. This leads to confusion and more difficult to read debugging (negative fileids matching positive inode numbers). Signed-off-by: Niels de Vos <ndevos@redhat.com> CC: Santosh Pradhan <spradhan@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2013-10-24nfs: use %p[dD] instead of open-coded (and often racy) equivalentsAl Viro1-11/+6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-30aio: Kill aio_rw_vect_retry()Kent Overstreet1-1/+0
This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2012-12-12nfs: fix page dirtying in NFS DIO read codepathJeff Layton1-7/+2
The NFS DIO code will dirty pages that catch read responses in order to handle the case where someone is doing DIO reads into an mmapped buffer. The existing code doesn't really do the right thing though since it doesn't take into account the case where we might be attempting to read past the EOF. Fix the logic in that code to only dirty pages that ended up receiving data from the read. Note too that it really doesn't matter if NFS_IOHDR_ERROR is set or not. All that matters is if the page was altered by the read. Cc: Fred Isaman <iisaman@netapp.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12nfs: don't zero out the rest of the page if we hit the EOF on a DIO READJeff Layton1-8/+0
Eryu provided a test program that would segfault when attempting to read past the EOF on file that was opened O_DIRECT. The buffer given to the read() call was on the stack, and when he attempted to read past it it would scribble over the rest of the stack page. If we hit the end of the file on a DIO READ request, then we don't want to zero out the rest of the buffer. These aren't pagecache pages after all, and there's no guarantee that the buffers that were passed in represent entire pages. Cc: <stable@vger.kernel.org> # v3.5+ Cc: Fred Isaman <iisaman@netapp.com> Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-08NFS41: send real write size in layoutgetPeng Tao1-0/+7
For buffer write, block layout client scan inode mapping to find next hole and use offset-to-hole as layoutget length. Object layout client uses offset-to-isize as layoutget length. For direct write, both block layout and object layout use dreq->bytes_left. Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-08NFS: track direct IO left bytesPeng Tao1-0/+5
Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01NFSv41: fix DIO write_io calculationPeng Tao1-2/+2
pnfs_within_mdsthreshold() is called inside pg_init. We need to set read_io/write_io before that. Otherwise we fail pnfs_within_mdsthreshold() and IO goes to MDS. A simple test case: dd if=foo of=/mnt/pnfs/bar bs=10M count=1 oflag=direct Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28NFS: Convert nfs_get_lock_context to return an ERR_PTR on failureTrond Myklebust1-4/+12
We want to be able to distinguish between allocation failures, and the case where the lock context is not needed (because there are no locks). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-31Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds1-28/+54
Merge Andrew's second set of patches: - MM - a few random fixes - a couple of RTC leftovers * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits) rtc/rtc-88pm80x: remove unneed devm_kfree rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables tmpfs: distribute interleave better across nodes mm: remove redundant initialization mm: warn if pg_data_t isn't initialized with zero mips: zero out pg_data_t when it's allocated memcg: gix memory accounting scalability in shrink_page_list mm/sparse: remove index_init_lock mm/sparse: more checks on mem_section number mm/sparse: optimize sparse_index_alloc memcg: add mem_cgroup_from_css() helper memcg: further prevent OOM with too many dirty pages memcg: prevent OOM with too many dirty pages mm: mmu_notifier: fix freed page still mapped in secondary MMU mm: memcg: only check anon swapin page charges for swap cache mm: memcg: only check swap cache pages for repeated charging mm: memcg: split swapin charge function into private and public part mm: memcg: remove needless !mm fixup to init_mm when charging mm: memcg: remove unneeded shmem charge type ...
2012-07-31nfs: enable swap on NFSMel Gorman1-28/+54
Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol ->connect() method. PF_MEMALLOC should allow the allocation of struct socket and related objects and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets required for the TCP connection buildup. [jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases] [dfeng@redhat.com: Fix handling of multiple swap files] [a.p.zijlstra@chello.nl: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30NFS: Convert v4 into a moduleBryan Schumaker1-1/+1
This patch exports symbols needed by the v4 module. In addition, I also switch over to using IS_ENABLED() to check if CONFIG_NFS_V4 or CONFIG_NFS_V4_MODULE are set. The module (nfs4.ko) will be created in the same directory as nfs.ko and will be automatically loaded the first time you try to mount over NFS v4. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30NFS: Convert v3 into a moduleBryan Schumaker1-1/+1
This patch exports symbols and moves over the final structures needed by the v3 module. In addition, I also switch over to using IS_ENABLED() to check if CONFIG_NFS_V3 or CONFIG_NFS_V3_MODULE are set. The module (nfs3.ko) will be created in the same directory as nfs.ko and will be automatically loaded the first time you try to mount over NFS v3. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30NFS: fix pnfs regression with directio writesFred Isaman1-2/+2
Commit 57208fa7e51 "NFS: Create an write_pageio_init() function" did not modify the calls in direct.c, preventing direct io from using pnfs. This reintroduces that capability. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30NFS: fix pnfs regression with directio readsFred Isaman1-1/+1
Commit 1abb50886af "NFS: Create an read_pageio_init() function" did not modify the call in direct.c, preventing direct io from using pnfs. This reintroduces that capability. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-08NFS: Fix list manipulation snafus in fs/nfs/direct.cTrond Myklebust1-1/+5
Fix 2 bugs in nfs_direct_write_reschedule: - The request needs to be removed from the 'reqs' list before it can be added to 'failed'. - Fix an infinite loop if the 'failed' list is non-empty. Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-06-19NFS: Fix a refcounting issue in O_DIRECTTrond Myklebust1-0/+1
In nfs_direct_write_reschedule(), the requests from nfs_scan_commit_list have a refcount of 2, whereas the operations in nfs_direct_write_completion_ops expect them to have a refcount of 1. This patch adds a call to release the extra references. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-06-15Merge tag 'nfs-for-3.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-4/+4
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Fix a couple of mount regressions due to the recent cleanups. - Fix an Oops in the open recovery code - Fix an rpc_pipefs upcall hang that results from some of the net namespace work from 3.4.x (stable kernel candidate). - Fix a couple of write and o_direct regressions that were found at last weeks Bakeathon testing event in Ann Arbor." * tag 'nfs-for-3.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: add an endian notation for sparse NFSv4.1: integer overflow in decode_cb_sequence_args() rpc_pipefs: allow rpc_purge_list to take a NULL waitq pointer NFSv4 do not send an empty SETATTR compound NFSv2: EOF incorrectly set on short read NFS: Use the NFS_DEFAULT_VERSION for v2 and v3 mounts NFS: fix directio refcount bug on commit NFSv4: Fix unnecessary delegation returns in nfs4_do_open NFSv4.1: Convert another trivial printk into a dprintk NFS4: Fix open bug when pnfs module blacklisted NFS: Remove incorrect BUG_ON in nfs_found_client NFS: Map minor mismatch error to protocol not support error. NFS: Fix a commit bug NFS4: Set parsed mount data version to 4 NFSv4.1: Ensure we clear session state flags after a session creation NFSv4.1: Convert a trivial printk into a dprintk NFSv4: Fix up decode_attr_mdsthreshold NFSv4: Fix an Oops in the open recovery code NFSv4.1: Fix a request leak on the back channel
2012-06-09NFS: fix directio refcount bug on commitFred Isaman1-2/+2
This reverts a hunk from commit 04277086577 "NFS: Clean up - Simplify reference counting in fs/nfs/direct.c" The cleanups in that patch affect the write path, but by the time processing hits commit the removed reference has been added back by nfs_scan_commit_list(). Without this reversion, any page that is sent to commit holds on to an unbalanced reference that is never freed. The immediate effect is an imbalance over the wire between OPENs and CLOSEs. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-06-05NFS: Fix a commit bugTrond Myklebust1-2/+2
The new commit code fails to copy the verifier into the wb_verf field of _all_ the nfs_page structures; it only copies it into the first entry. The consequence is that most requests end up failing to match in nfs_commit_release. Fix is to copy the verifier into the req->wb_verf field in nfs_write_completion. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-05-31NFS: Ensure that setattr and getattr wait for O_DIRECT write completionTrond Myklebust1-3/+12
Use the same mechanism as the block devices are using, but move the helper functions from fs/direct-io.c into fs/inode.c to remove the dependency on CONFIG_BLOCK. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Fred Isaman <iisaman@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-24NFSv4.1 add nfs_inode book keeping for mdsthresholdAndy Adamson1-0/+2
Keep track of the number of bytes read or written via buffered, direct, and mem-mapped i/o for use by mdsthreshold size_io hints. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-09NFS: Clean up - Simplify reference counting in fs/nfs/direct.cTrond Myklebust1-11/+4
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-05-09NFS: Clean up - Rename nfs_unlock_request and nfs_unlock_request_dont_releaseTrond Myklebust1-5/+5
Function rename to ensure that the functionality of nfs_unlock_request() mirrors that of nfs_lock_request(). Then let nfs_unlock_and_release_request() do the work of what used to be called nfs_unlock_request()... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-05-09NFS: Clean up - simplify nfs_lock_request()Trond Myklebust1-0/+1
We only have two places where we need to grab a reference when trying to lock the nfs_page. We're better off making that explicit. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-05-04NFS: Fix sparse warningsTrond Myklebust1-1/+1
Fix the following sparse warnings: fs/nfs/direct.c:221:6: warning: symbol 'nfs_direct_readpage_release' was not declared. Should it be static? fs/nfs/read.c:38:43: warning: non-ANSI function declaration of function 'nfs_readhdr_alloc' fs/nfs/objlayout/objio_osd.c:214:5: warning: symbol '__alloc_objio_seg' was not declared. Should it be static? Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com> Cc: Boaz Harrosh <bharrosh@panasas.com>
2012-05-04NFS: Fix O_DIRECT compile warningsTrond Myklebust1-4/+4
Fix the following compile warnings: fs/nfs/direct.c: In function 'nfs_direct_read_schedule_segment': fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types lacks a cast [enabled by default] fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types lacks a cast [enabled by default] fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types lacks a cast [enabled by default] fs/nfs/direct.c:352:27: warning: comparison of distinct pointer types lacks a cast [enabled by default] fs/nfs/direct.c: In function 'nfs_direct_write_schedule_segment': fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types lacks a cast [enabled by default] fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types lacks a cast [enabled by default] fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types lacks a cast [enabled by default] fs/nfs/direct.c:650:27: warning: comparison of distinct pointer types lacks a cast [enabled by default] Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
2012-05-01NFS: Simplify the nfs_read_completion functionsTrond Myklebust1-28/+20
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-04-30NFS: Use kmem_cache_zalloc() in nfs_direct_req_allocTrond Myklebust1-11/+1
Simplify the initialisation of O_DIRECT requests. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-04-30NFS: Simplify O_DIRECT page referencingTrond Myklebust1-16/+6
The O_DIRECT code shouldn't need to hold 2 references to each page. The reference held by the struct nfs_page should suffice. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-04-30NFS: O_DIRECT pgio_completion_ops error_cleanup must unlock the requestTrond Myklebust1-3/+15
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-04-30NFS: Ensure that we break out of read/write_schedule_segment on errorTrond Myklebust1-2/+3
Currently we do break out of the for() loop, but we also need to break out of the enclosing do {} while()... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
2012-04-30NFS: Define nfs_direct_write_schedule_work() when v3 and v4 are disabledBryan Schumaker1-0/+3
v2 doesn't have commits, so this function can be a no-op. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: rewrite directio write to use async coalesce codeFred Isaman1-297/+230
This also has the advantage that it allows directio to use pnfs. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: rewrite directio read to use async coalesce codeFred Isaman1-132/+123
This also has the advantage that it allows directio to use pnfs. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: merge _full and _partial write rpc_opsFred Isaman1-2/+8
Decouple nfs_pgio_header and nfs_write_data, and have (possibly multiple) nfs_write_datas each take a refcount on nfs_pgio_header. For the moment keeps nfs_write_header as a way to preallocate a single nfs_write_data with the nfs_pgio_header. The code doesn't need this, and would be prettier without, but given the amount of churn I am already introducing I didn't want to play with tuning new mempools. This also fixes bug in pnfs_ld_handle_write_error. In the case of desc->pg_bsize < PAGE_CACHE_SIZE, the pages list was empty, causing replay attempt to do nothing. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: merge _full and _partial read rpc_opsFred Isaman1-2/+8
Decouple nfs_pgio_header and nfs_read_data, and have (possibly multiple) nfs_read_datas each take a refcount on nfs_pgio_header. For the moment keeps nfs_read_header as a way to preallocate a single nfs_read_data with the nfs_pgio_header. The code doesn't need this, and would be prettier without, but given the amount of churn I am already introducing I didn't want to play with tuning new mempools. This also fixes bug in pnfs_ld_handle_read_error. In the case of desc->pg_bsize < PAGE_CACHE_SIZE, the pages list was empty, causing replay attempt to do nothing. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: create struct nfs_page_arrayFred Isaman1-17/+23
Both nfs_read_data and nfs_write_data devote several fields which can be combined into a single shared struct. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: create common nfs_pgio_header for both read and writeFred Isaman1-23/+50
In order to avoid duplicating all the data in nfs_read_data whenever we split it up into multiple RPC calls (either due to a short read result or due to rsize < PAGE_SIZE), we split out the bits that are the same per RPC call into a separate "header" structure. The goal this patch moves towards is to have a single header refcounted by several rpc_data structures. Thus, want to always refer from rpc_data to the header, and not the other way. This patch comes close to that ideal, but the directio code currently needs some special casing, isolated in the nfs_direct_[read_write]hdr_release() functions. This will be dealt with in a future patch. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: dprintks in directio code were referencing task after putFred Isaman1-4/+4
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: add a struct nfs_commit_data to replace nfs_write_data in commitsFred Isaman1-10/+7
Commits don't need the vectors of pages, etc. that writes do. Split out a separate structure for the commit operation. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-27NFS: grab open context in direct readFred Isaman1-2/+2
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-03-28Merge tag 'split-asm_system_h-for-linus-20120328' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system Pull "Disintegrate and delete asm/system.h" from David Howells: "Here are a bunch of patches to disintegrate asm/system.h into a set of separate bits to relieve the problem of circular inclusion dependencies. I've built all the working defconfigs from all the arches that I can and made sure that they don't break. The reason for these patches is that I recently encountered a circular dependency problem that came about when I produced some patches to optimise get_order() by rewriting it to use ilog2(). This uses bitops - and on the SH arch asm/bitops.h drags in asm-generic/get_order.h by a circuituous route involving asm/system.h. The main difficulty seems to be asm/system.h. It holds a number of low level bits with no/few dependencies that are commonly used (eg. memory barriers) and a number of bits with more dependencies that aren't used in many places (eg. switch_to()). These patches break asm/system.h up into the following core pieces: (1) asm/barrier.h Move memory barriers here. This already done for MIPS and Alpha. (2) asm/switch_to.h Move switch_to() and related stuff here. (3) asm/exec.h Move arch_align_stack() here. Other process execution related bits could perhaps go here from asm/processor.h. (4) asm/cmpxchg.h Move xchg() and cmpxchg() here as they're full word atomic ops and frequently used by atomic_xchg() and atomic_cmpxchg(). (5) asm/bug.h Move die() and related bits. (6) asm/auxvec.h Move AT_VECTOR_SIZE_ARCH here. Other arch headers are created as needed on a per-arch basis." Fixed up some conflicts from other header file cleanups and moving code around that has happened in the meantime, so David's testing is somewhat weakened by that. We'll find out anything that got broken and fix it.. * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits) Delete all instances of asm/system.h Remove all #inclusions of asm/system.h Add #includes needed to permit the removal of asm/system.h Move all declarations of free_initmem() to linux/mm.h Disintegrate asm/system.h for OpenRISC Split arch_align_stack() out from asm-generic/system.h Split the switch_to() wrapper out of asm-generic/system.h Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h Create asm-generic/barrier.h Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h Disintegrate asm/system.h for Xtensa Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt] Disintegrate asm/system.h for Tile Disintegrate asm/system.h for Sparc Disintegrate asm/system.h for SH Disintegrate asm/system.h for Score Disintegrate asm/system.h for S390 Disintegrate asm/system.h for PowerPC Disintegrate asm/system.h for PA-RISC Disintegrate asm/system.h for MN10300 ...
2012-03-28Remove all #inclusions of asm/system.hDavid Howells1-1/+0
Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *` Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-21NFS: Remove nfs4_setup_sequence from generic read codeBryan Schumaker1-2/+0
This is an NFS v4 specific operation, so it belongs in the NFS v4 code and not the generic client. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-03-21NFS: Remove nfs4_setup_sequence from generic write codeBryan Schumaker1-4/+0
This is an NFS v4 specific operation, so it belongs in the NFS v4 code and not the generic client. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-07-26atomic: use <linux/atomic.h>Arun Sharma1-1/+1
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-20nfs_open_context doesn't need struct path eitherAl Viro1-2/+2
just dentry, please... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-11NFS: account direct-io into task io accountingKonstantin Khlebnikov1-0/+5
Account NFS direct-io reads and writes into Task I/O Accounting. Do it before complition to handle aio. NFS have unusual direct-io implementation, thus accounting in generic code does not work. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-11NFS: remove pointless if statement in nfs_direct_write_resultFred Isaman1-2/+1
The code was doing nothing more in either branch of the if. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25NFS: Fix "kernel BUG at fs/aio.c:554!"Chuck Lever1-14/+20
Nick Piggin reports: > I'm getting use after frees in aio code in NFS > > [ 2703.396766] Call Trace: > [ 2703.396858] [<ffffffff8100b057>] ? native_sched_clock+0x27/0x80 > [ 2703.396959] [<ffffffff8108509e>] ? put_lock_stats+0xe/0x40 > [ 2703.397058] [<ffffffff81088348>] ? lock_release_holdtime+0xa8/0x140 > [ 2703.397159] [<ffffffff8108a2a5>] lock_acquire+0x95/0x1b0 > [ 2703.397260] [<ffffffff811627db>] ? aio_put_req+0x2b/0x60 > [ 2703.397361] [<ffffffff81039701>] ? get_parent_ip+0x11/0x50 > [ 2703.397464] [<ffffffff81612a31>] _raw_spin_lock_irq+0x41/0x80 > [ 2703.397564] [<ffffffff811627db>] ? aio_put_req+0x2b/0x60 > [ 2703.397662] [<ffffffff811627db>] aio_put_req+0x2b/0x60 > [ 2703.397761] [<ffffffff811647fe>] do_io_submit+0x2be/0x7c0 > [ 2703.397895] [<ffffffff81164d0b>] sys_io_submit+0xb/0x10 > [ 2703.397995] [<ffffffff8100307b>] system_call_fastpath+0x16/0x1b > > Adding some tracing, it is due to nfs completing the request then > returning something other than -EIOCBQUEUED, so aio.c > also completes the request. To address this, prevent the NFS direct I/O engine from completing async iocbs when the forward path returns an error without starting any I/O. This fix appears to survive ^C during both "xfstest no. 208" and "fsx -Z." It's likely this bug has existed for a very long while, as we are seeing very similar symptoms in OEL 5. Copying stable. Cc: Stable <stable@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-11-22Pure nfs client performance using odirect.Arun Bharadwaj1-1/+1
When an application opens a file with O_DIRECT flag, if the size of the data that is written is equal to wsize, the client sends a WRITE RPC with stable flag set to UNSTABLE followed by a single COMMIT RPC rather than sending a single WRITE RPC with the stable flag set to FILE_SYNC. This a bug. Patch to fix this. Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-28Fixed Regression in NFS Direct I/O pathSteve Dickson1-1/+1
A typo, introduced by commit f11ac8db, in the nfs_direct_write() routine causes writes with O_DIRECT set to fail with a ENOMEM error. Found-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve Dickson <steved@redhat.com> Cc: stable@kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30NFSv4: Ensure that we track the NFSv4 lock state in read/write requests.Trond Myklebust1-7/+22
This patch fixes bugzilla entry 14501: https://bugzilla.kernel.org/show_bug.cgi?id=14501 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>