aboutsummaryrefslogtreecommitdiffstats
path: root/fs/dlm
AgeCommit message (Collapse)AuthorFilesLines
2024-04-23dlm: return -ENOMEM if ls_recover_buf failsAlexander Aring1-1/+3
This patch fixes to return -ENOMEM in case of an allocation failure that was forgotten to change in commit 6c648035cbe7 ("dlm: switch to use rhashtable for rsbs"). Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202404200536.jGi6052v-lkp@intel.com/ Fixes: 6c648035cbe7 ("dlm: switch to use rhashtable for rsbs") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-17dlm: fix sleep in atomic contextAlexander Aring3-8/+8
This patch changes the orphans mutex to a spinlock since commit c288745f1d4a ("dlm: avoid blocking receive at the end of recovery") is using a rwlock_t to lock the DLM message receive path and do_purge() can be called while this lock is held that forbids to sleep. We need to use spin_lock_bh() because also a user context that calls dlm_user_purge() can call do_purge() and since commit 92d59adfaf71 ("dlm: do message processing in softirq context") the DLM message receive path is done under softirq context. Fixes: c288745f1d4a ("dlm: avoid blocking receive at the end of recovery") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/gfs2/9ad928eb-2ece-4ad9-a79c-d2bce228e4bc@moroto.mountain/ Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: use rwlock for lkbidrAlexander Aring3-41/+11
Convert the lock for lkbidr to an rwlock. Most idr lookups will use the read lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: use rwlock for rsb hash tableAlexander Aring7-87/+206
The conversion to rhashtable introduced a hash table lock per lockspace, in place of per bucket locks. To make this more scalable, switch to using a rwlock for hash table access. The common case fast path uses it as a read lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: drop dlm_scand kthread and use timersAlexander Aring7-239/+283
Currently the scand kthread acts like a garbage collection for expired rsbs on toss list, to clean them up after a certain timeout. It triggers every couple of seconds and iterates over the toss list while holding ls_rsbtbl_lock for the whole hash bucket iteration. To reduce the amount of time holding ls_rsbtbl_lock, we now handle the disposal of expired rsbs using a per-lockspace timer that expires for the earliest tossed rsb on the lockspace toss queue. This toss queue is ordered according to the rsb res_toss_time with the earliest tossed rsb as the first entry. The toss timer will only trylock() necessary locks, since it is low priority garbage collection, and will rearm the timer if trylock() fails. If the timer function does not find any expired rsb's, it rearms the timer with the next earliest expired rsb. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: do not use ref counts for rsb in the toss stateAlexander Aring3-32/+32
In the past we had problems when an rsb had a reference counter greater than one while in the toss state. An rsb in the toss state is not actively used for locking, and should not have any other references apart from the single ref keeping it on the rsb hash. Shift to freeing rsb's directly rather than using kref_put to free them, since the ref counting is not meant to be used in this state. Add warnings if ref counting is seen while an rsb is in the toss state. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: switch to use rhashtable for rsbsAlexander Aring8-160/+86
Replace our own hash table with the more advanced rhashtable for keeping rsb structs. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: add rsb lists for iterationAlexander Aring6-261/+84
To prepare for using rhashtable, add two rsb lists for iterating through rsb's in two uncommon cases where this is necesssary: - when dumping rsb state from debugfs, now using seq_list. - when looking at all rsb's during recovery. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: merge toss and keep hash table lists into one listAlexander Aring7-76/+98
There are several places where lock processing can perform two hash table lookups, first in the "keep" list, and if not found, in the "toss" list. This patch introduces a new rsb state flag "RSB_TOSS" to represent the difference between the state of being on keep vs toss list, so that the two lists can be combined. This avoids cases of two lookups. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: change to single hashtable lockAlexander Aring7-61/+60
Prepare to replace our own hash table with rhashtable by replacing the per-bucket locks in our own hash table with a single lock. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-16dlm: increment ls_count for dlm_scandAlexander Aring1-0/+3
Increment the ls_count value while dlm_scand is processing a lockspace so that release_lockspace()/remove_lockspace() will wait for dlm_scand to finish. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: do message processing in softirq contextAlexander Aring1-8/+20
Move dlm message processing from an ordered workqueue context to an ordered softirq context. Handling dlm messages in softirq will allow requests to be cleared more quickly and efficiently, and should avoid longer queues of incomplete requests. Later patches are expected to run completion/blocking callbacks directly from this message processing context, further reducing context switches required to complete a request. In the longer term, concurrent message processing could be implemented. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: use spin_lock_bh for message processingAlexander Aring14-258/+287
Use spin_lock_bh for all spinlocks involved in message processing, in preparation for softirq message processing. DLM lock requests from user space involve dlm processing in user context, in addition to the standard kernel context, necessitating bh variants. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: remove schedule in receive pathAlexander Aring1-1/+0
Remove an explicit schedule() call in the message processing path, in preparation for softirq message processing. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: convert ls_recv_active from rw_semaphore to rwlockAlexander Aring5-8/+8
Convert ls_recv_active rw_semaphore to an rwlock to avoid sleeping, in preparation for softirq message processing. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: avoid blocking receive at the end of recoveryAlexander Aring5-41/+30
The end of the recovery process transitioned to normal message processing by temporarily blocking the receiving context, processing saved messages, then unblocking the receiving context. To avoid blocking the receiving context, the old wait_queue and mutex are replaced by a new rwlock and the new RECV_MSG_BLOCKED flag. Received messages are added to the list of saved messages, protected by the rwlock, until the flag is cleared, which happens when all saved messages have been processed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: convert res_lock to spinlockAlexander Aring3-4/+4
Convert the rsb struct res_lock from a mutex to a spinlock in preparation for processing messages in softirq context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: convert ls_waiters_mutex to spinlockAlexander Aring4-14/+14
Convert the waiters mutex to a spinlock in prepration for processing messages in softirq context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: drop mutex use in waiters recoveryAlexander Aring3-8/+23
The waiters_mutex no longer needs to be used in the waiters recovery functions dlm_recover_waiters_pre() and dlm_recover_waiters_pre(). During recovery, ordinary locking operations are paused, and the recovery thread is the only context accessing the waiters list, so the lock is not needed. Access to the waiters list from debugfs functions is avoided by taking the top level recovery lock in the debugfs dump function. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: add new struct to save position in dlm_copy_master_namesAlexander Aring4-10/+116
Add a new struct to save the current position in the rsb masters_list while sending the rsb names to other nodes. The rsb names are sent in multiple chunks, and for each new chunk, the new "dlm_dir_dump" struct saves the last position in the masters_list. The new struct is also used to save more information to sanity check the recovery process. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: move rsb root_list to ls_recover() stackAlexander Aring9-70/+47
Move the rsb root_list from the lockspace to a stack variable since it is now only used by the ls_recover() function. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: use a new list for recovery of master rsb namesAlexander Aring5-14/+79
Add a new "masters_list" for master rsb structs, with a new rwlock. The new list is created and used during the recovery process to send the master rsb names to new nodes. With this change, the current "root_list" can be used without locking. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: move root_list functionality to recover.cAlexander Aring3-44/+39
Move dlm_create_root_list() and dlm_release_root_list() to recover.c and declare them static because they are only used there. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: switch to GFP_ATOMIC in dlm allocationsAlexander Aring4-8/+4
Replace GFP_NOFS with GFP_ATOMIC. Also stop using idr_preload which uses a non-bh spin_lock. This is further preparation for softirq message processing. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-09dlm: remove allocation parameter in msg allocationAlexander Aring8-57/+41
Remove the context parameter for message allocations and always use GFP_ATOMIC. This prepares for softirq message processing. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-02dlm: Simplify the allocation of slab caches in dlm_lowcomms_msg_cache_createKunwu Chan1-1/+1
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Acked-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: remove callback reference countingAlexander Aring7-55/+28
Get rid of the unnecessary refcounting on callback structs. Copy interesting callback info into the lkb struct rather than maintaining pointers to callback structs from the lkb. This goes back to the way things were done prior to commit 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks"). Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: fix race between final callback and removeAlexander Aring5-213/+129
This patch fixes the following issue: node 1 is dir node 2 is master node 3 is other 1->2: unlock 2: put final lkb, rsb moved to toss 2->1: unlock_reply 1: queue lkb callback with EUNLOCK 2->1: remove 1: receive_remove ignored (rsb on keep because of queued lkb callback) 1: complete lkb callback, put_lkb, move rsb to toss 3->1: lookup 1->3: lookup_reply master=2 3->2: request 2->3: request_reply EBADR In summary: An unexpected lkb reference causes the rsb to remain on the wrong list. The rsb being on the wrong list causes receive_remove to be ignored. An ignored receive_remove causes inconsistent dir and master state. This sequence requires an unusually long delay in delivering the unlock callback, because the remove message from 2->1 usually happens after some seconds. So, it's not known exactly how frequently this sequence occurs in pratice. It's possible that the same end result could also have another unknown cause. The solution for this issue is to further separate callback state from the lkb, so that an lkb reference (and from that, an rsb ref) are not held while a callback remains queued. Then, within the unlock_reply, the lkb will be freed and the rsb moved to the toss list. So, the receive_remove will not be ignored. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: combine switch case fail and default statementsAlexander Aring2-8/+6
This patch combines the failure and default cases for enqueue and dequeue a callback to the lkb callback queue that should end in both cases as it should never happen. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: save callback debug info earlierAlexander Aring1-4/+6
Save lkb callback info when queueing the callback so that the lkb struct is not needed in the callback workqueue processing. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: remove callback queue debugfs functionalityAlexander Aring1-96/+0
Remove the ability to dump pending lkb callbacks from debugfs. The prepares for separating lkb structs from callbacks. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: remove lkb from callback tracepointsAlexander Aring2-4/+14
Stop using lkb structs in the callback tracepoints so that lkb references are not needed. This prepares for separating lkb structs from callbacks. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: Simplify the allocation of slab caches in dlm_midcomms_cache_createKunwu Chan1-2/+1
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-04-01dlm: fix user space lock decision to copy lvbAlexander Aring3-13/+17
This patch fixes the copy lvb decision for user space lock requests. Checking dlm_lvb_operations is done earlier, where granted/requested lock modes are available to use in the matrix. The decision had been moved to the wrong location, where granted mode and requested mode where the same, which causes the dlm_lvb_operations matix to produce the wrong copy decision. For PW or EX requests, the caller could get invalid lvb data. Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-03-18Merge tag 'dlm-6.9' of ↵Linus Torvalds3-39/+81
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - Fix mistaken variable assignment that caused a refcounting problem - Revert a recent change that began using atomic counters where they were not needed (for lkb wait_count) - Add comments around forced state reset for waiting lock operations during recovery * tag 'dlm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: add comments about forced waiters reset dlm: revert atomic_t lkb_wait_count dlm: fix user space lkb refcounting
2024-03-15dlm: add comments about forced waiters resetDavid Teigland1-20/+58
When a lock is waiting for a reply for a remote operation, and recovery interrupts this "waiters" state, the remote operation is voided by the recovery, and no reply will be processed. The lkb waiters state for the remote operation is forcibly reset/cleared, so that the lock operation can be restarted after recovery. Improve the comments describing this. Signed-off-by: David Teigland <teigland@redhat.com>
2024-03-15dlm: revert atomic_t lkb_wait_countDavid Teigland2-15/+19
Revert "fs: dlm: handle lkb wait count as atomic_t" This reverts commit 75a7d60134ce84209f2c61ec4619ee543aa8f466. This counter does not need to be atomic. As the comment in the reverted commit mentions, the counter is protected by the rsb lock. Signed-off-by: David Teigland <teigland@redhat.com>
2024-03-12dlm: fix user space lkb refcountingAlexander Aring1-5/+5
This patch fixes to check on the right return value if it was the last callback. The rv variable got overwritten by the return of copy_result_to_user(). Fixing it by introducing a second variable for the return value and don't let rv being overwritten. Cc: stable@vger.kernel.org Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Reported-by: Valentin Vidić <vvidic@valentin-vidic.from.hr> Closes: https://lore.kernel.org/gfs2/Ze4qSvzGJDt5yxC3@valentin-vidic.from.hr Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2024-02-05dlm: adapt to breakup of struct file_lockJeff Layton1-23/+22
Most of the existing APIs have remained the same, but subsystems that access file_lock fields directly need to reach into struct file_lock_core now. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-37-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05filelock: split common fields into struct file_lock_coreJeff Layton1-0/+1
In a future patch, we're going to split file leases into their own structure. Since a lot of the underlying machinery uses the same fields move those into a new file_lock_core, and embed that inside struct file_lock. For now, add some macros to ensure that we can continue to build while the conversion is in progress. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05dlm: convert to using new filelock helpersJeff Layton1-5/+5
Convert to using the new file locking helper functions. Also, in later patches we're going to introduce some temporary macros with names that clash with the variable name in dlm_posix_unlock. Rename it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-8-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-20dlm: update format header reflect current formatAlexander Aring1-2/+2
Over the time the dlm debugfs format string has been changed but the header wasn't updated. This patch changes the first line dump header and their meaning to reflect the current formats. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-12-20dlm: fix format seq ops type 4Alexander Aring1-1/+1
This patch fixes to set the type 4 format ops in case of table_open4(). It got accidentially changed by commit 541adb0d4d10 ("fs: dlm: debugfs for queued callbacks") and since them toss debug dumps the same format as format 5 that are the queued ast callbacks for lkbs. Fixes: 541adb0d4d10 ("fs: dlm: debugfs for queued callbacks") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-11-16dlm: use FL_SLEEP to determine blocking vs non-blockingAlexander Aring1-1/+1
This patch uses the FL_SLEEP flag in struct file_lock to determine if the lock request is a blocking or non-blocking request. Before dlm was using IS_SETLKW() was being used which is not usable for lock requests coming from lockd when EXPORT_OP_SAFE_ASYNC_LOCK inside the export flags is set. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-11-16dlm: use fl_owner from lockdAlexander Aring1-14/+4
This patch is changing the fl_owner value in case of an nfs lock request to not be the pid of lockd. Instead this patch changes it to be the owner value that nfs is giving us. Currently there exists proved problems with this behaviour. One nfsd server was created to export a gfs2 filesystem mount. Two nfs clients doing a nfs mount of this export. Those two clients should conflict each other operating on the same nfs file. A small test program was written: int main(int argc, const char *argv[]) { struct flock fl = { .l_type = F_WRLCK, .l_whence = SEEK_SET, .l_start = 1L, .l_len = 1L, }; int fd; fd = open("filename", O_RDWR | O_CREAT, 0700); printf("try to lock...\n"); fcntl(fd, F_SETLKW, &fl); printf("locked!\n"); getc(stdin); return 0; } Running on both clients at the same time and don't interrupting by pressing any key. It will show that both clients are able to acquire the lock which shouldn't be the case. The issue is here that the fl_owner value is the same and the lock context of both clients should be separated. This patch lets lockd define how to deal with lock contexts and chose hopefully the right fl_owner value. A test after this patch was made and the locks conflicts each other which should be the case. Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-11-16dlm: use kernel_connect() and kernel_bind()Jordan Rife1-7/+7
Recent changes to kernel_connect() and kernel_bind() ensure that callers are insulated from changes to the address parameter made by BPF SOCK_ADDR hooks. This patch wraps direct calls to ops->connect() and ops->bind() with kernel_connect() and kernel_bind() to protect callers in such cases. Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/ Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect") Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind") Cc: stable@vger.kernel.org Signed-off-by: Jordan Rife <jrife@google.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12dlm: slow down filling up processing queueAlexander Aring1-0/+12
If there is a burst of message the receive worker will filling up the processing queue but where are too slow to process dlm messages. This patch will slow down the receiver worker to keep the buffer on the socket layer to tell the sender to backoff. This is done by a threshold to get the next buffers from the socket after all messages were processed done by a flush_workqueue(). This however only occurs when we have a message burst when we e.g. create 1 million locks. If we put more and more new messages to process in the processqueue we will soon run out of memory. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12dlm: fix no ack after final messageAlexander Aring1-3/+3
In case of an final DLM message we can't should not send an ack out after the final message. This patch moves the ack message before the messages will be transmitted. If it's the final message and the receiving node turns into DLM_CLOSED state another ack messages will being received and turning the receiving node into DLM_ESTABLISHED again. Fixes: 1696c75f1864 ("fs: dlm: add send ack threshold and append acks to msgs") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12dlm: be sure we reset all nodes at forced shutdownAlexander Aring1-2/+8
In case we running in a force shutdown in either midcomms or lowcomms implementation we will make sure we reset all per midcomms node information. Fixes: 63e711b08160 ("fs: dlm: create midcomms nodes when configure") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12dlm: fix remove member after close callAlexander Aring1-1/+12
The idea of commit 63e711b08160 ("fs: dlm: create midcomms nodes when configure") is to set the midcomms node lifetime when a node joins or leaves the cluster. Currently we can hit the following warning: [10844.611495] ------------[ cut here ]------------ [10844.615913] WARNING: CPU: 4 PID: 84304 at fs/dlm/midcomms.c:1263 dlm_midcomms_remove_member+0x13f/0x180 [dlm] or running in a state where we hit a midcomms node usage count in a negative value: [ 260.830782] node 2 users dec count -1 The first warning happens when the a specific node does not exists and it was probably removed but dlm_midcomms_close() which is called when a node leaves the cluster. The second kernel log message is probably in a case when dlm_midcomms_addr() is called when a joined the cluster but due fencing a node leaved the cluster without getting removed from the lockspace. If the node joins the cluster and it was removed from the cluster due fencing the first call is to remove the node from lockspaces triggered by the user space. In both cases if the node wasn't found or the user count is zero, we should ignore any additional midcomms handling of dlm_midcomms_remove_member(). Fixes: 63e711b08160 ("fs: dlm: create midcomms nodes when configure") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12dlm: fix creating multiple node structuresAlexander Aring1-1/+9
This patch will lookup existing nodes instead of always creating them when dlm_midcomms_addr() is called. The idea is here to create midcomms nodes when user space getting informed that nodes joins the cluster. This is the case when dlm_midcomms_addr() is called, however it can be called multiple times by user space to add several address configurations to one node e.g. when using SCTP. Those multiple times need to be filtered out and we doing that by looking up if the node exists before. Due configfs entry it is safe that this function gets only called once at a time. Fixes: 63e711b08160 ("fs: dlm: create midcomms nodes when configure") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12fs: dlm: Remove some useless memset()Christophe JAILLET1-5/+0
There is no need to clear the buffer used to build the file name. snprintf() already guarantees that it is NULL terminated and such a (useless) precaution was not done for the first string (i.e ls_debug_rsb_dentry) So, save a few LoC. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12fs: dlm: Fix the size of a buffer in dlm_create_debug_file()Christophe JAILLET1-1/+2
8 is not the maximum size of the suffix used when creating debugfs files. Let the compiler compute the correct size, and only give a hint about the longest possible string that is used. When building with W=1, this fixes the following warnings: fs/dlm/debug_fs.c: In function ‘dlm_create_debug_file’: fs/dlm/debug_fs.c:1020:58: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=] 1020 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_waiters", ls->ls_name); | ^ fs/dlm/debug_fs.c:1020:9: note: ‘snprintf’ output between 9 and 73 bytes into a destination of size 72 1020 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_waiters", ls->ls_name); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/dlm/debug_fs.c:1031:50: error: ‘_queued_asts’ directive output may be truncated writing 12 bytes into a region of size between 8 and 72 [-Werror=format-truncation=] 1031 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_queued_asts", ls->ls_name); | ^~~~~~~~~~~~ fs/dlm/debug_fs.c:1031:9: note: ‘snprintf’ output between 13 and 77 bytes into a destination of size 72 1031 | snprintf(name, DLM_LOCKSPACE_LEN + 8, "%s_queued_asts", ls->ls_name); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 541adb0d4d10b ("fs: dlm: debugfs for queued callbacks") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-10-12fs: dlm: Simplify buffer size computation in dlm_create_debug_file()Christophe JAILLET1-5/+5
Use sizeof(name) instead of the equivalent, but hard coded, DLM_LOCKSPACE_LEN + 8. This is less verbose and more future proof. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-25dlm: fix plock lookup when using multiple lockspacesAlexander Aring1-3/+3
All posix lock ops, for all lockspaces (gfs2 file systems) are sent to userspace (dlm_controld) through a single misc device. The dlm_controld daemon reads the ops from the misc device and sends them to other cluster nodes using separate, per-lockspace cluster api communication channels. The ops for a single lockspace are ordered at this level, so that the results are received in the same sequence that the requests were sent. When the results are sent back to the kernel via the misc device, they are again funneled through the single misc device for all lockspaces. When the dlm code in the kernel processes the results from the misc device, these results will be returned in the same sequence that the requests were sent, on a per-lockspace basis. A recent change in this request/reply matching code missed the "per-lockspace" check (fsid comparison) when matching request and reply, so replies could be incorrectly matched to requests from other lockspaces. Cc: stable@vger.kernel.org Reported-by: Barry Marson <bmarson@redhat.com> Fixes: 57e2c2f2d94c ("fs: dlm: fix mismatch of plock results from userspace") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: don't use RCOM_NAMES for version detectionAlexander Aring1-8/+8
Currently RCOM_STATUS and RCOM_NAMES inclusive their replies are being used to determine the DLM version. The RCOM_NAMES messages are triggered in DLM recovery when calling dlm_recover_directory() only. At this time the DLM version need to be determined. I ran some tests and did not expirenced some issues. When the DLM version detection was developed probably I run once in a case of RCOM_NAMES and the version was not detected yet. However it seems to be not necessary. For backwards compatibility we still need to accept RCOM_NAMES messages which are not protected regarding the DLM message reliability layer aka stateless message. This patch changes that RCOM_NAMES we are sending out after this patch are not stateless anymore. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: create midcomms nodes when configureAlexander Aring3-179/+110
This patch puts the life of a midcomms node the same as a lowcomms connection. The lowcomms connection lifetime was changed by commit 6f0b0b5d7ae7 ("fs: dlm: remove dlm_node_addrs lookup list"). In the future the midcomms node instances can be merged with lowcomms connection structure as the lifetime is the same and states can be controlled over values or flags. Before midcomms nodes were generated during version detection. This is not necessary anymore when the nodes are created when the cluster manager configures DLM via configfs. When a midcomms node is created over configfs it well set DLM_VERSION_NOT_SET as version. This indicates that the version of the midcomms node is still unknown and need to be probed via certain rcom messages. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: constify receive bufferAlexander Aring13-86/+101
The dlm receive buffer should be never manipulated as DLM is the last instance of parsing layer. This patch constify the whole receive buffer so we are sure it never gets manipulated when it's being parsed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: drop rxbuf manipulation in dlm_recover_master_copyAlexander Aring3-8/+17
Currently dlm_recover_master_copy() manipulates the receive buffer of an rcom lock message and modifies it on the fly so a later memcpy() to a new rcom message with the same message has those new values. This patch avoids manipulating the received rcom message by store the values for the new rcom message in paremter assigned with call by reference. Later when dlm_send_rcom_lock() constructs a new message and memcpy() the receive buffer those values will be set on the new constructed message. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: drop rxbuf manipulation in dlm_copy_master_namesAlexander Aring1-3/+2
This patch removes the manipulation of the receive buffer in case of an error and be sure the buffer is null terminated before an error messagea is printed out. Instead of manipulate the receive buffer we tell inside the format string the maximum length the string buffer is being read. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: get recovery sequence number as parameterAlexander Aring10-85/+99
This patch removes a read of the ls->ls_recover_seq uint64_t number in _create_rcom(). If the ls->ls_recover_seq is readed the ls_recover_lock need to held. However this number was always readed before when any rcom message is received and it's not necessary to read it again from a per lockspace variable to use it for the replying message. This patch will pass the sequence number as parameter so another read of ls->ls_recover_seq and holding the ls->ls_recover_lock is not required. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: cleanup lock orderAlexander Aring1-2/+2
This patch cleanups the lock order to hold at first the close_lock and then held the nodes_srcu read lock. Probably it will never be a problem as nodes_srcu is only a read lock preventing the node pointer getting freed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: remove clear_members_cbAlexander Aring1-6/+1
This patch is just a small cleanup to directly call remove_remote_member() instead of going over clear_members_cb() which just calls remove_remote_member(). Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: add plock dev tracepointsAlexander Aring1-0/+6
I currently debug nfs plock handling and introduce those two tracepoints for getting more information about what is happening there if the user space reads plock operations from kernel and writing the result back. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: check on plock ops when exit dlmAlexander Aring1-0/+2
To be sure we don't have any issues that there are leftover plock ops in either send_list or recv_list we simple check if either one of the list are empty when we exit the dlm subsystem. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: debugfs for queued callbacksAlexander Aring2-1/+101
It was useful to debug an issue with the callback queue to check if any callbacks in any lkb are for some reason not processed by the callback workqueue. The mentioned issue was fixed by commit a034c1370ded ("fs: dlm: fix DLM_IFL_CB_PENDING gets overwritten"). If there are similar issue that looks like a ast callback was not processed, we can confirm now that it is not sitting to be processed by the callback workqueue anymore. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: remove unused processed_nodesAlexander Aring1-1/+0
The variable processed_nodes is not being used by commit 1696c75f1864 ("fs: dlm: add send ack threshold and append acks to msgs"). This patch removes the leftover of this commit. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-10fs: dlm: add missing spin_unlockAlexander Aring1-0/+1
This patch fixes commit dc52cd2eff4a ("fs: dlm: fix F_CANCELLK to cancel pending request") that we don't unlock the ops_lock in a rate case when a waiter cannot be found. This case can only happen when cancellation of plock operation was successful but no kernel waiter was being found. Fixes: dc52cd2eff4a ("fs: dlm: fix F_CANCELLK to cancel pending request") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-07-20fs: dlm: fix F_CANCELLK to cancel pending requestAlexander Aring1-13/+90
This patch fixes the current handling of F_CANCELLK by not just doing a unlock as we need to try to cancel a lock at first. A unlock makes sense on a non-blocking lock request but if it's a blocking lock request we need to cancel the request until it's not granted yet. This patch is fixing this behaviour by first try to cancel a lock request and if it's failed it's unlocking the lock which seems to be granted. Note: currently the nfs locking handling was disabled by commit 40595cdc93ed ("nfs: block notification on fs with its own ->lock"). However DLM was never being updated regarding to this change. Future patches will try to fix lockd lock requests for DLM. This patch is currently assuming the upstream DLM lockd handling is correct. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-07-20fs: dlm: allow to F_SETLKW getting interruptedAlexander Aring1-20/+36
This patch implements dlm plock F_SETLKW interruption feature. If a blocking posix lock request got interrupted in user space by a signal a cancellation request for a non granted lock request to the user space lock manager will be send. The user lock manager answers either with zero or a negative errno code. A errno of -ENOENT signals that there is currently no blocking lock request waiting to being granted. In case of -ENOENT it was probably to late to request a cancellation and the pending lock got granted. In any error case we will wait until the lock is being granted as cancellation failed, this causes also that in case of an older user lock manager returning -EINVAL we will wait as cancellation is not supported which should be fine. If a user requires this feature the user should update dlm user space to support lock request cancellation. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-07-20fs: dlm: remove twice newlineAlexander Aring1-2/+2
This patch removes a newline which log_print() already adds, also removes wrapped string that causes a checkpatch warning. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-29Merge tag 'dlm-6.5' of ↵Linus Torvalds13-222/+203
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "The dlm posix lock handling (for gfs2) has three notable changes: - Local pids returned from GETLK are no longer negated. A previous patch negating remote pids mistakenly changed local pids also. - SETLKW operations can now be interrupted only when the process is killed, and not from other signals. General interruption was resulting in previously acquired locks being cleared, not just the in-progress lock. Handling this correctly will require extending a cancel capability to user space (a future feature.) - If multiple threads are requesting posix locks (with SETLKW), fix incorrect matching of results to the requests. The dlm networking has several minor cleanups, and one notable change: - Avoid delaying ack messages for too long (used for message reliability), resulting in a backlog of un-acked messages. These could previously be delayed as a result of either too many or too few other messages being sent. Now an upper and lower threshold is used to determine when an ack should be sent" * tag 'dlm-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: remove filter local comms on close fs: dlm: add send ack threshold and append acks to msgs fs: dlm: handle sequence numbers as atomic fs: dlm: handle lkb wait count as atomic_t fs: dlm: filter ourself midcomms calls fs: dlm: warn about messages from left nodes fs: dlm: move dlm_purge_lkb_callbacks to user module fs: dlm: cleanup STOP_IO bitflag set when stop io fs: dlm: don't check othercon twice fs: dlm: unregister memory at the very last fs: dlm: fix missing pending to false fs: dlm: clear pending bit when queue was empty fs: dlm: revert check required context while close fs: dlm: fix mismatch of plock results from userspace fs: dlm: make F_SETLK use unkillable wait_event fs: dlm: interrupt posix locks only when process is killed fs: dlm: fix cleanup pending ops when interrupted fs: dlm: return positive pid value for F_GETLK dlm: Replace all non-returning strlcpy with strscpy
2023-06-28Merge tag 'net-next-6.5' of ↵Linus Torvalds1-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-24dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpageDavid Howells1-3/+7
When transmitting data, call down a layer using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather using sendpage. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Christine Caulfield <ccaulfie@redhat.com> cc: David Teigland <teigland@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: cluster-devel@redhat.com Link: https://lore.kernel.org/r/20230623225513.2732256-7-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-20fs: dlm: remove filter local comms on closeAlexander Aring1-2/+1
The current way how lowcomms is configured is due configfs entries. Each comms configfs entry will create a lowcomms connection. Even the local connection itself will be stored as a lowcomms connection, although most functionality for a local lowcomms connection struct is not necessary. Now in some scenarios we will see that dlm_controld reports a -EEXIST when configure a node via configfs: ... /sys/kernel/config/dlm/cluster/comms/1/addr: write failed: 17 -1 Doing a: cat /sys/kernel/config/dlm/cluster/comms/1/addr_list reported nothing. This was being seen on cluster with nodeid 1 and it's local configuration. To be sure the configfs entries are in sync with lowcomms connection structures we always call dlm_midcomms_close() to be sure the lowcomms connection gets removed when the configfs entry gets dropped. Before commit 07ee38674a0b ("fs: dlm: filter ourself midcomms calls") it was just doing this by accident and the filter by doing: if (nodeid == dlm_our_nodeid()) return 0; inside dlm_midcomms_close() was never been hit because drop_comm() sets local_comm to NULL and cause that dlm_our_nodeid() returns always the invalid nodeid 0. Fixes: 07ee38674a0b ("fs: dlm: filter ourself midcomms calls") Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: add send ack threshold and append acks to msgsAlexander Aring2-75/+31
This patch changes the time when we sending an ack back to tell the other side it can free some message because it is arrived on the receiver node, due random reconnects e.g. TCP resets this is handled as well on application layer to not let DLM run into a deadlock state. The current handling has the following problems: 1. We end in situations that we only send an ack back message of 16 bytes out and no other messages. Whereas DLM has logic to combine so much messages as it can in one send() socket call. This behaviour can be discovered by "trace-cmd start -e dlm_recv" and observing the ret field being 16 bytes. 2. When processing of DLM messages will never end because we receive a lot of messages, we will not send an ack back as it happens when the processing loop ends. This patch introduces a likely and unlikely threshold case. The likely case will send an ack back on a transmit path if the threshold is triggered of amount of processed upper layer protocol. This will solve issue 1 because it will be send when another normal DLM message will be sent. It solves issue 2 because it is not part of the processing loop. There is however a unlikely case, the unlikely case has a bigger threshold and will be triggered when we only receive messages and do not sent any message back. This case avoids that the sending node will keep a lot of message for a long time as we send sometimes ack backs to tell the sender to finally release messages. The atomic cmpxchg() is there to provide a atomically ack send with reset of the upper layer protocol delivery counter. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: handle sequence numbers as atomicAlexander Aring1-15/+25
Currently seq_next is only be read on the receive side which processed in an ordered way. The seq_send is being protected by locks. To being able to read the seq_next value on send side as well we convert it to an atomic_t value. The atomic_cmpxchg() is probably not necessary, however the atomic_inc() depends on a if coniditional and this should be handled in an atomic context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: handle lkb wait count as atomic_tAlexander Aring2-19/+15
Currently the lkb_wait_count is locked by the rsb lock and it should be fine to handle lkb_wait_count as non atomic_t value. However for the overall process of reducing locking this patch converts it to an atomic_t value. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: filter ourself midcomms callsAlexander Aring4-20/+32
It makes no sense to call midcomms/lowcomms functionality for the local node as socket functionality is only required for remote nodes. This patch filters those calls in the upper layer of lockspace membership handling instead of doing it in midcomms/lowcomms layer as they should never be aware of local nodeid. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: warn about messages from left nodesAlexander Aring1-2/+2
This patch warns about messages which are received from nodes who already left the lockspace resource signaled by the cluster manager. Before commit 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") there was a synchronization issue with the socket lifetime and the cluster event of leaving a lockspace and other nodes did not stop of sending messages because the cluster manager has a pending message to leave the lockspace. The reliable session layer for dlm use sequence numbers to ensure dlm message were never being dropped. If this is not corrected synchronized we have a problem, this patch will use the filter case and turn it into a WARN_ON_ONCE() so we seeing such issue on the kernel log because it should never happen now. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: move dlm_purge_lkb_callbacks to user moduleAlexander Aring4-18/+19
This patch moves the dlm_purge_lkb_callbacks() function from ast to user dlm module as it is only a function being used by dlm user implementation. I got be hinted to hold specific locks regarding the callback handling for dlm_purge_lkb_callbacks() but it was false positive. It is confusing because ast dlm implementation uses a different locking behaviour as user locks uses as DLM handles kernel and user dlm locks differently. To avoid the confusing we move this function to dlm user implementation. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: cleanup STOP_IO bitflag set when stop ioAlexander Aring1-8/+4
There should no difference between setting the CF_IO_STOP flag before restore_callbacks() to do it before or afterwards. The restore_callbacks() will be sure that no callback is executed anymore when the bit wasn't set. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: don't check othercon twiceAlexander Aring1-2/+1
This patch removes an another check if con->othercon set inside the branch which already does that. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: unregister memory at the very lastAlexander Aring1-1/+1
The dlm modules midcomms, debugfs, lockspace, uses kmem caches. We ensure that the kmem caches getting deallocated after those modules exited. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: fix missing pending to falseAlexander Aring1-0/+1
This patch sets the process_dlm_messages_pending boolean to false when there was no message to process. It is a case which should not happen but if we are prepared to recover from this situation by setting pending boolean to false. Cc: stable@vger.kernel.org Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: clear pending bit when queue was emptyAlexander Aring1-3/+5
This patch clears the DLM_IFL_CB_PENDING_BIT flag which will be set when there is callback work queued when there was no callback to dequeue. It is a buggy case and should never happen, that's why there is a WARN_ON(). However if the case happens we are prepared to somehow recover from it. Cc: stable@vger.kernel.org Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14fs: dlm: revert check required context while closeAlexander Aring3-16/+0
This patch reverts commit 2c3fa6ae4d52 ("dlm: check required context while close"). The function dlm_midcomms_close(), which will call later dlm_lowcomms_close(), is called when the cluster manager tells the node got fenced which means on midcomms/lowcomms layer to disconnect the node from the cluster communication. The node can rejoin the cluster later. This patch was ensuring no new message were able to be triggered when we are in the close() function context. This was done by checking if the lockspace has been stopped. However there is a missing check that we only need to check specific lockspaces where the fenced node is member of. This is currently complicated because there is no way to easily check if a node is part of a specific lockspace without stopping the recovery. For now we just revert this commit as it is just a check to finding possible leaks of stopping lockspaces before close() is called. Cc: stable@vger.kernel.org Fixes: 2c3fa6ae4d52 ("dlm: check required context while close") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-24fs: dlm: fix mismatch of plock results from userspaceAlexander Aring1-13/+45
When a waiting plock request (F_SETLKW) is sent to userspace for processing (dlm_controld), the result is returned at a later time. That result could be incorrectly matched to a different waiting request in cases where the owner field is the same (e.g. different threads in a process.) This is fixed by comparing all the properties in the request and reply. The results for non-waiting plock requests are now matched based on list order because the results are returned in the same order they were sent. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-23fs: dlm: make F_SETLK use unkillable wait_eventAlexander Aring1-17/+21
While a non-waiting posix lock request (F_SETLK) is waiting for user space processing (in dlm_controld), wait for that processing to complete with an unkillable wait_event(). This makes F_SETLK behave the same way for F_RDLCK, F_WRLCK and F_UNLCK. F_SETLKW continues to use wait_event_killable(). Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22fs: dlm: interrupt posix locks only when process is killedAlexander Aring1-1/+1
If a posix lock request is waiting for a result from user space (dlm_controld), do not let it be interrupted unless the process is killed. This reverts commit a6b1533e9a57 ("dlm: make posix locks interruptible"). The problem with the interruptible change is that all locks were cleared on any signal interrupt. If a signal was received that did not terminate the process, the process could continue running after all its dlm posix locks had been cleared. A future patch will add cancelation to allow proper interruption. Cc: stable@vger.kernel.org Fixes: a6b1533e9a57 ("dlm: make posix locks interruptible") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22dlm: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-2/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230510221237.3509484-1-azeemshaikh38@gmail.com
2023-05-22fs: dlm: fix cleanup pending ops when interruptedAlexander Aring1-19/+6
Immediately clean up a posix lock request if it is interrupted while waiting for a result from user space (dlm_controld.) This largely reverts the recent commit b92a4e3f86b1 ("fs: dlm: change posix lock sigint handling"). That previous commit attempted to defer lock cleanup to the point in time when a result from user space arrived. The deferred approach was not reliable because some dlm plock ops may not receive replies. Cc: stable@vger.kernel.org Fixes: b92a4e3f86b1 ("fs: dlm: change posix lock sigint handling") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22fs: dlm: return positive pid value for F_GETLKAlexander Aring1-1/+3
The GETLK pid values have all been negated since commit 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks"). Revert this for local pids, and leave in place negative pids for remote owners. Cc: stable@vger.kernel.org Fixes: 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22dlm: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-2/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-04-26Merge tag 'net-next-6.4' of ↵Linus Torvalds1-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances - Reduce compound page head access for zero-copy data transfers - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking - Add lockless accesses annotation to sk_err[_soft] - Optimize again the skb struct layout - Extends the skb drop reasons to make it usable by multiple subsystems - Better const qualifier awareness for socket casts BPF: - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward - Add more precise memory usage reporting for all BPF map types - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them - Add ARM32 USDT support to libbpf - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations Protocols: - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter: - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support - The iptables 32bit compat interface isn't compiled in by default anymore - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device Driver API: - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them - Allow the page_pool to directly recycle the pages from safely localized NAPI - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization - Add YNL support for user headers and struct attrs - Add partial YNL specification for devlink - Add partial YNL specification for ethtool - Add tc-mqprio and tc-taprio support for preemptible traffic classes - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device - Add basic LED support for switch/phy - Add NAPI documentation, stop relaying on external links - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space - Add transceiver support and improve the error messages for CAN-FD controllers New hardware / drivers: - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers: - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors - add support for configuring max SDU for each Tx queue - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support" * tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits) net: phy: hide the PHYLIB_LEDS knob net: phy: marvell-88x2222: remove unnecessary (void*) conversions tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. net: amd: Fix link leak when verifying config failed net: phy: marvell: Fix inconsistent indenting in led_blink_set lan966x: Don't use xdp_frame when action is XDP_TX tsnep: Add XDP socket zero-copy TX support tsnep: Add XDP socket zero-copy RX support tsnep: Move skb receive action to separate function tsnep: Add functions for queue enable/disable tsnep: Rework TX/RX queue initialization tsnep: Replace modulo operation with mask net: phy: dp83867: Add led_brightness_set support net: phy: Fix reading LED reg property drivers: nfc: nfcsim: remove return value check of `dev_dir` net: phy: dp83867: Remove unnecessary (void*) conversions net: ethtool: coalesce: try to make user settings stick twice net: mana: Check if netdev/napi_alloc_frag returns single page net: mana: Rename mana_refill_rxoob and remove some empty lines net: veth: add page_pool stats ...
2023-04-21fs: dlm: stop unnecessarily filling zero ms_extra bytesAlexander Aring1-1/+1
Commit 7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr") fixes an issue when the lkb->lkb_lvbptr set to an dangled pointer and an followed memcpy() would fail. It was fixed by an additional check of DLM_LKF_VALBLK flag. The mentioned commit forgot to add an additional check if DLM_LKF_VALBLK is set for the additional amount of LVB data allocated in a dlm message. This patch is changing the message allocation to check additionally if DLM_LKF_VALBLK is set otherwise a dangled lkb->lkb_lvbptr pointer would allocated zero LVB message data which not gets filled with actual data. This patch is however only a cleanup to reduce the amount of zero bytes transmitted over network as receive_lvb() will only evaluates message LVB data if DLM_LKF_VALBLK is set. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-17net: annotate lockless accesses to sk->sk_err_softEric Dumazet1-3/+4
This field can be read/written without lock synchronization. tcp and dccp have been handled in different patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-06fs: dlm: switch lkb_sbflags to atomic opsAlexander Aring2-12/+38
This patch moves lkb_sbflags handling to atomic bits ops. This should prepare for a possible manipulating of lkb_sbflags flags at the same time by concurrent execution. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: rsb hash table flag value to atomic opsAlexander Aring2-6/+6
This patch moves the rsb hash table handling to atomic flag operations. The flag operations for DLM_RTF_SHRINK are protected by ls->ls_rsbtbl[b].lock. However we switch to atomic ops if new possible flags will be used in a different way and don't assume such lock dependencies. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: move internal flags to atomic opsAlexander Aring6-83/+90
This patch will move the lkb_flags value to the recently introduced lkb_iflags value. For lkb_iflags we use atomic bit operations because some flags like DLM_IFL_CB_PENDING are used while non rsb lock is held to avoid issues with other flag manipulations which might run at the same time we switch to atomic bit operations. Snapshot the bit values to an uint32_t value is only used for debugging/logging use cases and don't need to be 100% correct. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: change dflags to use atomic bitsAlexander Aring7-23/+63
Currently manipulating lkb_dflags assumes to held the rsb lock assigned to the lkb. This is held by dlm message processing after certain time to lookup the right rsb from the received lkb message id. For user space locks flags, which is currently the only use case for lkb_dflags, flags are also being set during dlm character device handling without holding the rsb lock. To minimize the risk that bit operations are getting corrupted we switch to atomic bit operations. This patch will also introduce helpers to snapshot atomic bit values in an non atomic way. There might be still issues with the flag handling e.g. running in case of manipulating bit ops and snapshot them at the same time, but this patch minimize them and will start to use atomic bit operations. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: store lkb distributed flags into own valueAlexander Aring7-32/+27
This patch stores lkb distributed flags value in an separate value instead of sharing internal and distributed flags in lkb->lkb_flags value. This has the advantage to not mask/write back flag values in receive_flags() functionality. The dlm debug_fs does not provide the distributed flags anymore, those can be added in future. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: remove DLM_IFL_LOCAL_MS flagAlexander Aring2-33/+33
The DLM_IFL_LOCAL_MS flag is an internal non shared flag but used in m_flags of dlm messages. It is not shared because it is only used for local messaging. Instead using DLM_IFL_LOCAL_MS in dlm messages we pass a parameter around to signal local messaging or not. This patch is adding the local parameter to signal local messaging. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: rename stub to local message flagAlexander Aring3-59/+59
This patch renames DLM_IFL_STUB_MS to DLM_IFL_LOCAL_MS flag. The DLM_IFL_STUB_MS flag is somewhat misnamed, it means the dlm message is used for local message transfer only. It is used by recovery to resolve lock states if a node got fenced. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: remove deprecated code partsAlexander Aring12-469/+1
This patch removes code parts which was declared deprecated by commit 6b0afc0cc3e9 ("fs: dlm: don't use deprecated timeout features by default"). This contains the following dlm functionality: - start a cancel of a dlm request did not complete after certain timeout: The current way how dlm cancellation works and interfering with other dlm requests triggered by the user can end in an overlapping and returning in -EBUSY. The most user don't handle this case and are unaware that DLM can return such errno in such situation. Due the timeout the user are mostly unaware when this happens. - start a netlink warning messages for user space if dlm requests did not complete after certain timeout: This feature was never being built in the only known dlm user space side. As we are to remove the timeout cancellation feature we can directly remove this feature as well. There might be the possibility to bring the timeout cancellation feature back. However the current way of handling the -EBUSY case which is only a software limitation and not a hardware limitation should be changed. We minimize the current code base in DLM cancellation feature to not have to deal with those existing features while solving the DLM cancellation feature in general. UAPI define DLM_LSFL_TIMEWARN is commented as deprecated and reserved value. We should avoid at first to give it a new meaning but let possible users still compile by keeping this define. In far future we can give this flag a new meaning. The same for the DLM_LKF_TIMEOUT lock request flag. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06DLM: increase socket backlog to avoid hangs with 16 nodesEdwin Török1-1/+1
On a 16 node virtual cluster with e1000 NICs joining the 12th node prints SYN flood warnings for the DLM port: Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies. Check SNMP counters. And then joining a DLM lockspace hangs: ``` Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds. Dec 21 01:49:00 localhost kernel: [ 2285.786476] Not tainted 4.4.0+10 #1 Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd D ffff88001930bc58 0 17638 1 0x00000000 Dec 21 01:49:00 localhost kernel: [ 2285.794615] ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000 Dec 21 01:49:00 localhost kernel: [ 2285.794617] ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000 Dec 21 01:49:00 localhost kernel: [ 2285.794619] ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10 Dec 21 01:49:00 localhost kernel: [ 2285.794644] [<ffffffff811570fe>] ? printk+0x4d/0x4f Dec 21 01:49:00 localhost kernel: [ 2285.794647] [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 Dec 21 01:49:00 localhost kernel: [ 2285.794649] [<ffffffff815a085d>] wait_for_completion+0x9d/0x110 Dec 21 01:49:00 localhost kernel: [ 2285.794653] [<ffffffff810979e0>] ? wake_up_q+0x80/0x80 Dec 21 01:49:00 localhost kernel: [ 2285.794661] [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm] Dec 21 01:49:00 localhost kernel: [ 2285.794665] [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100 Dec 21 01:49:00 localhost kernel: [ 2285.794670] [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm] Dec 21 01:49:00 localhost kernel: [ 2285.794673] [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0 Dec 21 01:49:00 localhost kernel: [ 2285.794677] [<ffffffff811b4438>] __vfs_write+0x28/0xd0 Dec 21 01:49:00 localhost kernel: [ 2285.794679] [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0 Dec 21 01:49:00 localhost kernel: [ 2285.794681] [<ffffffff811b4dc1>] vfs_write+0xb1/0x190 Dec 21 01:49:00 localhost kernel: [ 2285.794686] [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420 Dec 21 01:49:00 localhost kernel: [ 2285.794688] [<ffffffff811b5986>] SyS_write+0x46/0xa0 Dec 21 01:49:00 localhost kernel: [ 2285.794690] [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71 ``` The previous limit of 5 seems like an arbitrary number, that doesn't match any known DLM cluster size upper bound limit. Signed-off-by: Edwin Török <edvin.torok@citrix.com> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: add unbound flag to dlm_io workqueueAlexander Aring1-2/+2
This patch will add the WQ_UNBOUND flag to the lowcomms dlm_io workqueue which handles socket io handling to send and receive dlm messages. The amount of sockets will be 2 for a 3 node cluster. Each socket has two different workers for doing send and receive work by calling socket API functionality. Each worker will do their task in order to send dlm messages in a ordered stream based socket communication. On receive side the receive buffer will be queued up for an ordered dlm_process workqueue to parse received dlm messages. The parsing need to be done currently in an ordered synchronized way because the dlm message processing is not being made to parse parallel. After explaining all those workqueue behaviours in lowcomms, the dlm_io workqueue is only being used for socket handling. Each socket handling has 2 workers (send and receive). In a 3 cluster node we will end up with 4 workers. Without the WQ_UNBOUND flag the workers are tight to a CPU and can never switch, this could be an advantage because local CPU execution. However with dlm_locktorture testcase I expierenced not all workers are always in use and my assumption is that some workers are bound to the same CPU. We should always send or receive when we are ready to do so, one reason why we disable nigel algorithm on sockets. We should be safe to do the socket io handling on any CPU which can be switched during runtime. There is no assumption that the worker stays on the same CPU. There is no need to respect any workqueue concurrency model that each worker can only run on one CPU. Lowcomms queue_work() mechanism has an higher level flag to be sure that it can't schedule work if the previous worker did not signal it to keep ordered socket handling. Therefore this patch sets the WQ_UNBOUND flag to allow workers being executed by any available CPU. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: fix DLM_IFL_CB_PENDING gets overwrittenAlexander Aring3-7/+9
This patch introduce a new internal flag per lkb value to handle internal flags which are handled not on wire. The current lkb internal flags stored as lkb->lkb_flags are split in upper and lower bits, the lower bits are used to share internal flags over wire for other cluster wide lkb copies on other nodes. In commit 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") we introduced a new internal flag for pending callbacks for the dlm callback queue. This flag is protected by the lkb->lkb_cb_lock lock. This patch overlooked that on dlm receive path and the mentioned upper and lower bits, that dlm will read the flags, mask it and write it back. As example receive_flags() in fs/dlm/lock.c. This flag manipulation is not done atomically and is not protected by lkb->lkb_cb_lock. This has unknown side effects of the current callback handling. In future we should move to set/clear/test bit functionality and avoid read, mask and writing back flag values. In later patches we will move the upper parts to the new introduced internal lkb flags which are not shared between other cluster nodes to the new non shared internal flag field to avoid similar issues. Cc: stable@vger.kernel.org Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Reported-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-02-24Merge tag 'driver-core-6.3-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.3-rc1. There's a lot of changes this development cycle, most of the work falls into two different categories: - fw_devlink fixes and updates. This has gone through numerous review cycles and lots of review and testing by lots of different devices. Hopefully all should be good now, and Saravana will be keeping a watch for any potential regression on odd embedded systems. - driver core changes to work to make struct bus_type able to be moved into read-only memory (i.e. const) The recent work with Rust has pointed out a number of areas in the driver core where we are passing around and working with structures that really do not have to be dynamic at all, and they should be able to be read-only making things safer overall. This is the contuation of that work (started last release with kobject changes) in moving struct bus_type to be constant. We didn't quite make it for this release, but the remaining patches will be finished up for the release after this one, but the groundwork has been laid for this effort. Other than that we have in here: - debugfs memory leak fixes in some subsystems - error path cleanups and fixes for some never-able-to-be-hit codepaths. - cacheinfo rework and fixes - Other tiny fixes, full details are in the shortlog All of these have been in linux-next for a while with no reported problems" [ Geert Uytterhoeven points out that that last sentence isn't true, and that there's a pending report that has a fix that is queued up - Linus ] * tag 'driver-core-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (124 commits) debugfs: drop inline constant formatting for ERR_PTR(-ERROR) OPP: fix error checking in opp_migrate_dentry() debugfs: update comment of debugfs_rename() i3c: fix device.h kernel-doc warnings dma-mapping: no need to pass a bus_type into get_arch_dma_ops() driver core: class: move EXPORT_SYMBOL_GPL() lines to the correct place Revert "driver core: add error handling for devtmpfs_create_node()" Revert "devtmpfs: add debug info to handle()" Revert "devtmpfs: remove return value of devtmpfs_delete_node()" driver core: cpu: don't hand-override the uevent bus_type callback. devtmpfs: remove return value of devtmpfs_delete_node() devtmpfs: add debug info to handle() driver core: add error handling for devtmpfs_create_node() driver core: bus: update my copyright notice driver core: bus: add bus_get_dev_root() function driver core: bus: constify bus_unregister() driver core: bus: constify some internal functions driver core: bus: constify bus_get_kset() driver core: bus: constify bus_register/unregister_notifier() driver core: remove private pointer from struct bus_type ...
2023-02-21Merge tag 'net-next-6.3' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
2023-02-20Merge tag 'dlm-6.3' of ↵Linus Torvalds6-97/+136
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This fixes some races in the lowcomms startup and shutdown code that were found by targeted stress testing that quickly and repeatedly joins and leaves lockspaces" * tag 'dlm-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: remove unnecessary waker_up() calls fs: dlm: move state change into else branch fs: dlm: remove newline in log_print fs: dlm: reduce the shutdown timeout to 5 secs fs: dlm: make dlm sequence id more robust fs: dlm: wait until all midcomms nodes detect version fs: dlm: ignore unexpected non dlm opts msgs fs: dlm: bring back previous shutdown handling fs: dlm: send FIN ack back in right cases fs: dlm: move sending fin message into state change handling fs: dlm: don't set stop rx flag after node reset fs: dlm: fix race setting stop tx flag fs: dlm: be sure to call dlm_send_queue_flush() fs: dlm: fix use after free in midcomms commit fs: dlm: start midcomms before scand fs/dlm: Remove "select SRCU" fs: dlm: fix return value check in dlm_memory_init()
2023-01-27kobject: kset_uevent_ops: make uevent() callback take a const *Greg Kroah-Hartman1-2/+2
The uevent() callback in struct kset_uevent_ops does not modify the kobject passed into it, so make the pointer const to enforce this restriction. When doing so, fix up all existing uevent() callbacks to have the correct signature to preserve the build. Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230111113018.459199-17-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-23fs: dlm: remove unnecessary waker_up() callsAlexander Aring1-2/+0
The wake_up() is already handled inside of midcomms_node_reset() when switching the state to CLOSED state. So there is not need to call it after midcomms_node_reset() again. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: move state change into else branchAlexander Aring1-3/+4
Currently we can switch at first into DLM_CLOSE_WAIT state and then do another state change if a condition is true. Instead of doing two state changes we handle the other state change inside an else branch of this condition. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: remove newline in log_printAlexander Aring1-4/+4
There is an API difference between log_print() and other printk()s to put a newline or not. This one was introduced by mistake because log_print() adds a newline. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: reduce the shutdown timeout to 5 secsAlexander Aring1-2/+2
When a shutdown is stuck, time out after 5 seconds instead of 3 minutes. After this timeout we try a forced shutdown. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: make dlm sequence id more robustAlexander Aring1-1/+1
When joining a new lockspace, use a random number to initialize a sequence number used in messages. This makes it easier to detect sequence number mismatches in message replies during tests that repeatedly join and leave a lockspace. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: wait until all midcomms nodes detect versionAlexander Aring3-0/+27
The current dlm version detection is very complex due to backwards compatablilty with earlier dlm protocol versions. It takes some time to detect if a peer node has a specific DLM version. If it's not detected, we just cut the socket connection. There could be cases where the local node has not detected the version yet, but the peer node has. In these cases, we are trying to shutdown the dlm connection with a FIN/ACK message exchange to be sure the other peer is ready to shutdown the connection on dlm application level. However this mechanism is only available on DLM protocol version 3.2 and we need to be sure the DLM version is detected before. To make it more robust we introduce a a "best effort" wait to wait for the version detection before shutdown the dlm connection. This need to be done before the kthread recoverd for recovery handling is stopped, because recovery handling will trigger enough messages to have a version detection going on. It is a corner case which was detected by modprobe dlm_locktroture module and rmmod dlm_locktorture module directly afterwards (in a looping behaviour). In practice probably nobody would leave a lockspace immediately after joining it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: ignore unexpected non dlm opts msgsAlexander Aring1-10/+2
This patch ignores unexpected RCOM_NAMES/RCOM_STATUS messages. To be backwards compatible, those messages are not part of the new reliable DLM OPTS encapsulation header, and have their own retransmit handling using sequence number matching When we get unexpected non dlm opts messages, we should allow them and let RCOM message handling filter them out using sequence numbers. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: bring back previous shutdown handlingAlexander Aring2-34/+63
This patch mostly reverts commit 4f567acb0b86 ("fs: dlm: remove socket shutdown handling"). There can be situations where the dlm midcomms nodes hash and lowcomms connection hash are not equal, but we need to guarantee that the lowcomms are all closed on a last release of a dlm lockspace, when a shutdown is invoked. This patch guarantees that we always close all sockets managed by the lowcomms connection hash, and calls shutdown for the last message sent. This ensures we don't cut the socket, which could cause the peer to get a connection reset. In future we should try to merge the midcomms/lowcomms hashes into one hash and not handle both in separate hashes. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: send FIN ack back in right casesAlexander Aring1-4/+5
This patch moves to send a ack back for receiving a FIN message only when we are in valid states. In other cases and there might be a sender waiting for a ack we just let it timeout at the senders time and hopefully all other cleanups will remove the FIN message on their sending queue. As an example we should never send out an ACK being in LAST_ACK state or we cannot assume a working socket communication when we are in CLOSED state. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: move sending fin message into state change handlingAlexander Aring1-24/+9
This patch moves the send fin handling, which should appear in a specific state change, into the state change handling while the per node state_lock is held. I experienced issues with other messages because we changed the state and a fin message was sent out in a different state. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: don't set stop rx flag after node resetAlexander Aring1-2/+1
Similar to the stop tx flag, the rx flag should warn about a dlm message being received at DLM_FIN state change, when we are assuming no other dlm application messages. If we receive a FIN message and we are in the state DLM_FIN_WAIT2 we call midcomms_node_reset() which puts the midcomms node into DLM_CLOSED state. Afterwards we should not set the DLM_NODE_FLAG_STOP_RX flag any more. This patch changes the setting DLM_NODE_FLAG_STOP_RX in those state changes when we receive a FIN message and we assume there will be no other dlm application messages received until we hit DLM_CLOSED state. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: fix race setting stop tx flagAlexander Aring1-1/+1
This patch sets the stop tx flag before we commit the dlm message. This flag will report about unexpected transmissions after we send the DLM_FIN message out, which should be the last message sent. When we commit the dlm fin message, it could be that we already got an ack back and the CLOSED state change already happened. We should not set this flag when we are in CLOSED state. To avoid this race we simply set the tx flag before the state change can be in progress by moving it before dlm_midcomms_commit_mhandle(). Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: be sure to call dlm_send_queue_flush()Alexander Aring1-0/+1
If we release a midcomms node structure, there should be nothing left inside the dlm midcomms send queue. However, sometimes this is not true because I believe some DLM_FIN message was not acked... if we run into a shutdown timeout, then we should be sure there is no pending send dlm message inside this queue when releasing midcomms node structure. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: fix use after free in midcomms commitAlexander Aring1-0/+7
While working on processing dlm message in softirq context I experienced the following KASAN use-after-free warning: [ 151.760477] ================================================================== [ 151.761803] BUG: KASAN: use-after-free in dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.763414] Read of size 4 at addr ffff88811a980c60 by task lock_torture/1347 [ 151.765284] CPU: 7 PID: 1347 Comm: lock_torture Not tainted 6.1.0-rc4+ #2828 [ 151.766778] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-3.module+el8.7.0+16134+e5908aa2 04/01/2014 [ 151.768726] Call Trace: [ 151.769277] <TASK> [ 151.769748] dump_stack_lvl+0x5b/0x86 [ 151.770556] print_report+0x180/0x4c8 [ 151.771378] ? kasan_complete_mode_report_info+0x7c/0x1e0 [ 151.772241] ? dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.773069] kasan_report+0x93/0x1a0 [ 151.773668] ? dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.774514] __asan_load4+0x7e/0xa0 [ 151.775089] dlm_midcomms_commit_mhandle+0x19d/0x4b0 [ 151.775890] ? create_message.isra.29.constprop.64+0x57/0xc0 [ 151.776770] send_common+0x19f/0x1b0 [ 151.777342] ? remove_from_waiters+0x60/0x60 [ 151.778017] ? lock_downgrade+0x410/0x410 [ 151.778648] ? __this_cpu_preempt_check+0x13/0x20 [ 151.779421] ? rcu_lockdep_current_cpu_online+0x88/0xc0 [ 151.780292] _convert_lock+0x46/0x150 [ 151.780893] convert_lock+0x7b/0xc0 [ 151.781459] dlm_lock+0x3ac/0x580 [ 151.781993] ? 0xffffffffc0540000 [ 151.782522] ? torture_stop+0x120/0x120 [dlm_locktorture] [ 151.783379] ? dlm_scan_rsbs+0xa70/0xa70 [ 151.784003] ? preempt_count_sub+0xd6/0x130 [ 151.784661] ? is_module_address+0x47/0x70 [ 151.785309] ? torture_stop+0x120/0x120 [dlm_locktorture] [ 151.786166] ? 0xffffffffc0540000 [ 151.786693] ? lockdep_init_map_type+0xc3/0x360 [ 151.787414] ? 0xffffffffc0540000 [ 151.787947] torture_dlm_lock_sync.isra.3+0xe9/0x150 [dlm_locktorture] [ 151.789004] ? torture_stop+0x120/0x120 [dlm_locktorture] [ 151.789858] ? 0xffffffffc0540000 [ 151.790392] ? lock_torture_cleanup+0x20/0x20 [dlm_locktorture] [ 151.791347] ? delay_tsc+0x94/0xc0 [ 151.791898] torture_ex_iter+0xc3/0xea [dlm_locktorture] [ 151.792735] ? torture_start+0x30/0x30 [dlm_locktorture] [ 151.793606] lock_torture+0x177/0x270 [dlm_locktorture] [ 151.794448] ? torture_dlm_lock_sync.isra.3+0x150/0x150 [dlm_locktorture] [ 151.795539] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 151.796476] ? do_raw_spin_lock+0x11e/0x1e0 [ 151.797152] ? mark_held_locks+0x34/0xb0 [ 151.797784] ? _raw_spin_unlock_irqrestore+0x30/0x70 [ 151.798581] ? __kthread_parkme+0x79/0x110 [ 151.799246] ? trace_preempt_on+0x2a/0xf0 [ 151.799902] ? __kthread_parkme+0x79/0x110 [ 151.800579] ? preempt_count_sub+0xd6/0x130 [ 151.801271] ? __kasan_check_read+0x11/0x20 [ 151.801963] ? __kthread_parkme+0xec/0x110 [ 151.802630] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 151.803569] kthread+0x192/0x1d0 [ 151.804104] ? kthread_complete_and_exit+0x30/0x30 [ 151.804881] ret_from_fork+0x1f/0x30 [ 151.805480] </TASK> [ 151.806111] Allocated by task 1347: [ 151.806681] kasan_save_stack+0x26/0x50 [ 151.807308] kasan_set_track+0x25/0x30 [ 151.807920] kasan_save_alloc_info+0x1e/0x30 [ 151.808609] __kasan_slab_alloc+0x63/0x80 [ 151.809263] kmem_cache_alloc+0x1ad/0x830 [ 151.809916] dlm_allocate_mhandle+0x17/0x20 [ 151.810590] dlm_midcomms_get_mhandle+0x96/0x260 [ 151.811344] _create_message+0x95/0x180 [ 151.811994] create_message.isra.29.constprop.64+0x57/0xc0 [ 151.812880] send_common+0x129/0x1b0 [ 151.813467] _convert_lock+0x46/0x150 [ 151.814074] convert_lock+0x7b/0xc0 [ 151.814648] dlm_lock+0x3ac/0x580 [ 151.815199] torture_dlm_lock_sync.isra.3+0xe9/0x150 [dlm_locktorture] [ 151.816258] torture_ex_iter+0xc3/0xea [dlm_locktorture] [ 151.817129] lock_torture+0x177/0x270 [dlm_locktorture] [ 151.817986] kthread+0x192/0x1d0 [ 151.818518] ret_from_fork+0x1f/0x30 [ 151.819369] Freed by task 1336: [ 151.819890] kasan_save_stack+0x26/0x50 [ 151.820514] kasan_set_track+0x25/0x30 [ 151.821128] kasan_save_free_info+0x2e/0x50 [ 151.821812] __kasan_slab_free+0x107/0x1a0 [ 151.822483] kmem_cache_free+0x204/0x5e0 [ 151.823152] dlm_free_mhandle+0x18/0x20 [ 151.823781] dlm_mhandle_release+0x2e/0x40 [ 151.824454] rcu_core+0x583/0x1330 [ 151.825047] rcu_core_si+0xe/0x20 [ 151.825594] __do_softirq+0xf4/0x5c2 [ 151.826450] Last potentially related work creation: [ 151.827238] kasan_save_stack+0x26/0x50 [ 151.827870] __kasan_record_aux_stack+0xa2/0xc0 [ 151.828609] kasan_record_aux_stack_noalloc+0xb/0x20 [ 151.829415] call_rcu+0x4c/0x760 [ 151.829954] dlm_mhandle_delete+0x97/0xb0 [ 151.830718] dlm_process_incoming_buffer+0x2fc/0xb30 [ 151.831524] process_dlm_messages+0x16e/0x470 [ 151.832245] process_one_work+0x505/0xa10 [ 151.832905] worker_thread+0x67/0x650 [ 151.833507] kthread+0x192/0x1d0 [ 151.834046] ret_from_fork+0x1f/0x30 [ 151.834900] The buggy address belongs to the object at ffff88811a980c30 which belongs to the cache dlm_mhandle of size 88 [ 151.836894] The buggy address is located 48 bytes inside of 88-byte region [ffff88811a980c30, ffff88811a980c88) [ 151.839007] The buggy address belongs to the physical page: [ 151.839904] page:0000000076cf5d62 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11a980 [ 151.841378] flags: 0x8000000000000200(slab|zone=2) [ 151.842141] raw: 8000000000000200 0000000000000000 dead000000000122 ffff8881089b43c0 [ 151.843401] raw: 0000000000000000 0000000000220022 00000001ffffffff 0000000000000000 [ 151.844640] page dumped because: kasan: bad access detected [ 151.845822] Memory state around the buggy address: [ 151.846602] ffff88811a980b00: fb fb fb fb fc fc fc fc fa fb fb fb fb fb fb fb [ 151.847761] ffff88811a980b80: fb fb fb fc fc fc fc fa fb fb fb fb fb fb fb fb [ 151.848921] >ffff88811a980c00: fb fb fc fc fc fc fa fb fb fb fb fb fb fb fb fb [ 151.850076] ^ [ 151.851085] ffff88811a980c80: fb fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb [ 151.852269] ffff88811a980d00: fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb fc [ 151.853428] ================================================================== [ 151.855618] Disabling lock debugging due to kernel taint It is accessing a mhandle in dlm_midcomms_commit_mhandle() and the mhandle was freed by a call_rcu() call in dlm_process_incoming_buffer(), dlm_mhandle_delete(). It looks like it was freed because an ack of this message was received. There is a short race between committing the dlm message to be transmitted and getting an ack back. If the ack is faster than returning from dlm_midcomms_commit_msg_3_2(), then we run into a use-after free because we still need to reference the mhandle when calling srcu_read_unlock(). To avoid that, we don't allow that mhandle to be freed between dlm_midcomms_commit_msg_3_2() and srcu_read_unlock() by using rcu read lock. We can do that because mhandle is protected by rcu handling. Cc: stable@vger.kernel.org Fixes: 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23fs: dlm: start midcomms before scandAlexander Aring1-8/+8
The scand kthread can send dlm messages out, especially dlm remove messages to free memory for unused rsb on other nodes. To send out dlm messages, midcomms must be initialized. This patch moves the midcomms start before scand is started. Cc: stable@vger.kernel.org Fixes: e7fd41792fc0 ("[DLM] The core of the DLM for GFS2/CLVM") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-23net/sock: Introduce trace_sk_data_ready()Peilin Ye1-0/+5
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example: <...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...> Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-11filelock: move file locking definitions to separate header fileJeff Layton1-0/+1
The file locking definitions have lived in fs.h since the dawn of time, but they are only used by a small subset of the source files that include it. Move the file locking definitions to a new header file, and add the appropriate #include directives to the source files that need them. By doing this we trim down fs.h a bit and limit the amount of rebuilding that has to be done when we make changes to the file locking APIs. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Steve French <stfrench@microsoft.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-05fs/dlm: Remove "select SRCU"Paul E. McKenney1-1/+0
Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: David Teigland <teigland@redhat.com>
2023-01-05fs: dlm: fix return value check in dlm_memory_init()Yang Yingliang1-1/+1
It should check 'cb_cache', after calling kmem_cache_create("dlm_cb"). Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-12-19Treewide: Stop corrupting socket's task_fragBenjamin Coddington1-0/+2
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the GFP_NOIO flag on sk_allocation which the networking system uses to decide when it is safe to use current->task_frag. The results of this are unexpected corruption in task_frag when SUNRPC is involved in memory reclaim. The corruption can be seen in crashes, but the root cause is often difficult to ascertain as a crashing machine's stack trace will have no evidence of being near NFS or SUNRPC code. I believe this problem to be much more pervasive than reports to the community may indicate. Fix this by having kernel users of sockets that may corrupt task_frag due to reclaim set sk_use_task_frag = false. Preemptively correcting this situation for users that still set sk_allocation allows them to convert to memalloc_nofs_save/restore without the same unexpected corruptions that are sure to follow, unlikely to show up in testing, and difficult to bisect. CC: Philipp Reisner <philipp.reisner@linbit.com> CC: Lars Ellenberg <lars.ellenberg@linbit.com> CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com> CC: Jens Axboe <axboe@kernel.dk> CC: Josef Bacik <josef@toxicpanda.com> CC: Keith Busch <kbusch@kernel.org> CC: Christoph Hellwig <hch@lst.de> CC: Sagi Grimberg <sagi@grimberg.me> CC: Lee Duncan <lduncan@suse.com> CC: Chris Leech <cleech@redhat.com> CC: Mike Christie <michael.christie@oracle.com> CC: "James E.J. Bottomley" <jejb@linux.ibm.com> CC: "Martin K. Petersen" <martin.petersen@oracle.com> CC: Valentina Manea <valentina.manea.m@gmail.com> CC: Shuah Khan <shuah@kernel.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: David Howells <dhowells@redhat.com> CC: Marc Dionne <marc.dionne@auristor.com> CC: Steve French <sfrench@samba.org> CC: Christine Caulfield <ccaulfie@redhat.com> CC: David Teigland <teigland@redhat.com> CC: Mark Fasheh <mark@fasheh.com> CC: Joel Becker <jlbec@evilplan.org> CC: Joseph Qi <joseph.qi@linux.alibaba.com> CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Latchesar Ionkov <lucho@ionkov.net> CC: Dominique Martinet <asmadeus@codewreck.org> CC: Ilya Dryomov <idryomov@gmail.com> CC: Xiubo Li <xiubli@redhat.com> CC: Chuck Lever <chuck.lever@oracle.com> CC: Jeff Layton <jlayton@kernel.org> CC: Trond Myklebust <trond.myklebust@hammerspace.com> CC: Anna Schumaker <anna@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-22fs: dlm: fix building without lockdepAlexander Aring1-1/+5
This patch uses assert_spin_locked() instead of lockdep_is_held() where it's available to use because lockdep_is_held() is only available if CONFIG_LOCKDEP is set. In other cases like lockdep_sock_is_held() we surround it by a CONFIG_LOCKDEP idef. Fixes: dbb751ffab0b ("fs: dlm: parallelize lowcomms socket handling") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: parallelize lowcomms socket handlingAlexander Aring3-484/+586
This patch is rework of lowcomms handling, the main goal was here to handle recvmsg() and sendpage() to run parallel. Parallel in two senses: 1. per connection and 2. that recvmsg()/sendpage() doesn't block each other. Currently recvmsg()/sendpage() cannot run parallel because two workqueues "dlm_recv" and "dlm_send" are ordered workqueues. That means only one work item can be executed. The amount of queue items will be increased about the amount of nodes being inside the cluster. The current two workqueues for sending and receiving can also block each other if the same connection is executed at the same time in dlm_recv and dlm_send workqueue because a per connection mutex for the socket handling. To make it more parallel we introduce one "dlm_io" workqueue which is not an ordered workqueue, the amount of workers are not limited. Due per connection flags SEND/RECV pending we schedule workers ordered per connection and per send and receive task. To get rid of the mutex blocking same workers to do socket handling we switched to a semaphore which handles socket operations as read lock and sock releases as write operations, to prevent sock_release() being called while the socket is being used. There might be more optimization removing the semaphore and replacing it with other synchronization mechanism, however due other circumstances e.g. othercon behaviour it seems complicated to doing this change. I added comments to remove the othercon handling and moving to a different synchronization mechanism as this is done. We need to do that to the next dlm major version upgrade because it is not backwards compatible with the current connect mechanism. The processing of dlm messages need to be still handled by a ordered workqueue. An dlm_process ordered workqueue was introduced which gets filled by the receive worker. This is probably the next bottleneck of DLM but the application can't currently parse dlm messages parallel. A comment was introduced to lift the workqueue context of dlm processing in a non-sleepable softirq to get messages processing done fast. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: don't init error valueAlexander Aring1-1/+1
This patch removes a init of an error value to -EINVAL which is not necessary. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: use saved sk_error_report()Alexander Aring1-5/+1
This patch changes the handling of calling the original sk_error_report() by not putting it on the stack and calling it later. If the listen_sock.sk_error_report() is NULL in this moment it indicates a bug in our implementation. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: use sock2con without checking nullAlexander Aring1-13/+4
This patch removes null checks on private data for sockets. If we have a null dereference there we having a bug in our implementation that such callback occurs in this state. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: remove dlm_node_addrs lookup listAlexander Aring1-154/+136
This patch merges the dlm_node_addrs lookup list to the connection structure. It is a per node mapping to some configuration setup by configfs. We don't need two lookup structures. The connection hash has now a lifetime like the dlm_node_addrs entries. Means we add only new entries when configure cluster and not while new connections are coming in, remove connection when a node got fenced and cleanup all connection when the dlm exits. It should work the same and even will show more issues because we don't try to somehow keep those two data structures in sync with the current cluster configuration. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: don't put dlm_local_addrs on heapAlexander Aring1-26/+12
This patch removes to allocate the dlm_local_addr[] pointers on the heap. Instead we directly store the type of "struct sockaddr_storage". This removes function deinit_local() because it was freeing memory only. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: cleanup listen sock handlingAlexander Aring1-34/+17
This patch removes save_listen_callbacks() and add_listen_sock() as they are only used once in lowcomms functionality. For shutdown lowcomms it's not necessary to whole flush the workqueues to synchronize with restoring the old sk_data_ready() callback. Only the listen con receive work need to be cancelled. For each individual node shutdown we should be sure that last ack was been transmitted which is done by flushing per connection swork worker. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: remove socket shutdown handlingAlexander Aring3-107/+27
Since commit 489d8e559c65 ("fs: dlm: add reliable connection if reconnect") we have functionality like TCP offers for half-closed sockets on dlm application protocol layer. This feature is required because the cluster manager events about leaving resource memberships can be locally already occurred but other cluster nodes having a pending leaving membership over the cluster manager protocol happening. In this time the local dlm node already shutdown it's connection and don't transmit anymore any new dlm messages, but however it still needs to be able to accept dlm messages because the pending leave membership request of the cluster manager protocol which the dlm kernel implementation has no control about it. We have this functionality on the application for two reasons, the main reason is that SCTP does not support such functionality on socket layer. But we can do it inside application layer. Another small issue is that this feature is broken in the TCP world because some NAT devices does not implement such functionality correctly. This is the same reason why the reliable connection session layer in DLM exists. We give up on middle devices in the networking which sends e.g. TCP resets out. In DLM we cannot have any message dropping and we ensure it over a session layer that it can't happen. Back to the half-closed grace shutdown handling. It's not necessary anymore to do it on socket layer (which is only support for TCP sockets) because we do it on application layer. This patch removes this handling, if there are still issues then we have a problem on the application layer for such handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: use listen sock as dlm running indicatorAlexander Aring3-15/+10
This patch will switch from dlm_allow_conn to check if dlm lowcomms is running or not to if we actually have a listen socket set or not. The list socket will be set and unset in lowcomms start and shutdown functionality. To synchronize with data_ready() callback we will set the socket callback to NULL while socket lock is held. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: use list_first_entry_or_nullAlexander Aring1-6/+3
Instead of check on list_empty() we can do the same with list_first_entry_or_null() and return NULL if the returned value is NULL. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: remove twice INIT_WORKAlexander Aring1-1/+0
This patch removed a twice INIT_WORK() functionality. We already doing this inside of dlm_lowcomms_init() functionality which is called only once dlm is loaded. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: add midcomms init/start functionsAlexander Aring6-12/+37
This patch introduces leftovers of init, start, stop and exit functionality. The dlm application layer should always call the midcomms layer which getting aware of such event and redirect it to the lowcomms layer. Some functionality which is currently handled inside the start functionality of midcomms and lowcomms should be handled in the init functionality as it only need to be initialized once when dlm is loaded. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: add dst nodeid for msg tracingAlexander Aring1-4/+6
In DLM when we send a dlm message it is easy to add the lock resource name, but additional lookup is required when to trace the receive message side. The idea here is to move the lookup work to the user by using a lookup to find the right send message with recv message. As note DLM can't drop any message which is guaranteed by a special session layer. For doing the lookup a 3 tupel is required as an unique identification which is dst nodeid, src nodeid and sequence number. This patch adds the destination nodeid to the dlm message trace points. The source nodeid is given by the h_nodeid field inside the header. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: rename DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDINGAlexander Aring3-8/+6
This patch renames DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDING because CB_PENDING is a proper name to describe this flag. This flag is set when callback enqueue will return DLM_ENQUEUE_CALLBACK_NEED_SCHED because the callback worker need to be queued. The flag tells that callbacks are currently pending to be called and will be unset if the callback work for the specific lkb is done. The term need schedule is part of this time but a proper name is to say that there are some callbacks pending to being called. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: ast do WARN_ON_ONCE() on hotpathAlexander Aring2-7/+7
This patch changes the ast hotpath functionality in very unlikely cases that we do WARN_ON_ONCE() instead of WARN_ON() to not spamming the console output if we run into states that it would occur over and over again. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: drop lkb ref in bug caseAlexander Aring1-1/+2
This patch will drop the lkb reference in an very unlikely case which should in practice not happened. However if it happens we cleanup the reference just in case. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-21fs: dlm: avoid false-positive checker warningAlexander Aring1-1/+2
This patch avoid the false-positive checker warning about writing 112 bytes into a 88 bytes field "e->request", see: [ 54.891560] dlm: csmb1: dlm_recover_directory 23 out 2 messages [ 54.990542] ------------[ cut here ]------------ [ 54.991012] memcpy: detected field-spanning write (size 112) of single field "&e->request" at fs/dlm/requestqueue.c:47 (size 88) [ 54.992150] WARNING: CPU: 0 PID: 297 at fs/dlm/requestqueue.c:47 dlm_add_requestqueue+0x177/0x180 [ 54.993002] CPU: 0 PID: 297 Comm: kworker/u4:3 Not tainted 6.1.0-rc5-00008-ge01d50cbd6ee #248 [ 54.993878] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 [ 54.994718] Workqueue: dlm_recv process_recv_sockets [ 54.995230] RIP: 0010:dlm_add_requestqueue+0x177/0x180 [ 54.995731] Code: e7 01 0f 85 3b ff ff ff b9 58 00 00 00 48 c7 c2 c0 41 74 82 4c 89 ee 48 c7 c7 20 42 74 82 c6 05 8b 8d 30 02 01 e8 51 07 be 00 <0f> 0b e9 12 ff ff ff 66 90 0f 1f 44 00 00 41 57 48 8d 87 10 08 00 [ 54.997483] RSP: 0018:ffffc90000b1fbe8 EFLAGS: 00010282 [ 54.997990] RAX: 0000000000000000 RBX: ffff888024fc3d00 RCX: 0000000000000000 [ 54.998667] RDX: 0000000000000001 RSI: ffffffff81155014 RDI: fffff52000163f73 [ 54.999342] RBP: ffff88800dbac000 R08: 0000000000000001 R09: ffffc90000b1fa5f [ 54.999997] R10: fffff52000163f4b R11: 203a7970636d656d R12: ffff88800cfb0018 [ 55.000673] R13: 0000000000000070 R14: ffff888024fc3d18 R15: 0000000000000000 [ 55.001344] FS: 0000000000000000(0000) GS:ffff88806d600000(0000) knlGS:0000000000000000 [ 55.002078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.002603] CR2: 00007f35d4f0b9a0 CR3: 0000000025495002 CR4: 0000000000770ef0 [ 55.003258] PKRU: 55555554 [ 55.003514] Call Trace: [ 55.003756] <TASK> [ 55.003953] dlm_receive_buffer+0x1c0/0x200 [ 55.004348] dlm_process_incoming_buffer+0x46d/0x780 [ 55.004786] ? kernel_recvmsg+0x8b/0xc0 [ 55.005150] receive_from_sock.isra.0+0x168/0x420 [ 55.005582] ? process_listen_recv_socket+0x10/0x10 [ 55.006018] ? finish_task_switch.isra.0+0xe0/0x400 [ 55.006469] ? __switch_to+0x2fe/0x6a0 [ 55.006808] ? read_word_at_a_time+0xe/0x20 [ 55.007197] ? strscpy+0x146/0x190 [ 55.007505] process_one_work+0x3d0/0x6b0 [ 55.007863] worker_thread+0x8d/0x620 [ 55.008209] ? __kthread_parkme+0xd8/0xf0 [ 55.008565] ? process_one_work+0x6b0/0x6b0 [ 55.008937] kthread+0x171/0x1a0 [ 55.009251] ? kthread_exit+0x60/0x60 [ 55.009582] ret_from_fork+0x1f/0x30 [ 55.009903] </TASK> [ 55.010120] ---[ end trace 0000000000000000 ]--- [ 55.025783] dlm: csmb1: dlm_recover 5 generation 3 done: 201 ms [ 55.026466] gfs2: fsid=smbcluster:csmb1.0: recover generation 3 done It seems the checker is unable to detect the additional length bytes which was allocated additionally for the flexible array in struct dlm_message. To solve it we split the memcpy() into copy for the 88 bytes struct and another memcpy() for the flexible array m_extra field. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: use WARN_ON_ONCE() instead of WARN_ON()Alexander Aring1-9/+9
To not get the console spammed about WARN_ON() of invalid states in the dlm midcomms hot path handling we switch to WARN_ON_ONCE() to get it only once that there might be an issue with the midcomms state handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: fix log of lowcomms vs midcommsAlexander Aring1-1/+1
This patch will fix a small issue when printing out that dlm_midcomms_start() failed to start and it was printing out that the dlm subcomponent lowcomms was failed but lowcomms is behind the midcomms layer. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: catch dlm_add_member() errorAlexander Aring1-1/+4
This patch will catch a possible dlm_add_member() and delivers it to the dlm recovery handling. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: relax sending to allow receivingAlexander Aring1-5/+10
This patch drops additionally the sock_mutex when there is a sending message burst. Since we have acknowledge handling we free sending buffers only when we receive an ack back, but if we are stuck in send_to_sock() looping because dlm sends a lot of messages and we never leave the loop the sending buffer fill up very quickly. We can't receive during this iteration because the sock_mutex is held. This patch will unlock the sock_mutex so it should be possible to receive messages when a burst of sending messages happens. This will allow to free up memory because acks which are already received can be processed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: remove ls_remove_wait waitqueueAlexander Aring3-61/+2
This patch removes the ls_remove_wait waitqueue handling. The current handling tries to wait before a lookup is send out for a identically resource name which is going to be removed. Hereby the remove message should be send out before the new lookup message. The reason is that after a lookup request and response will actually use the specific remote rsb. A followed remove message would delete the rsb on the remote side but it's still being used. To reach a similar behaviour we simple send the remove message out while the rsb lookup lock is held and the rsb is removed from the toss list. Other find_rsb() calls would never have the change to get a rsb back to live while a remove message will be send out (without holding the lock). This behaviour requires a non-sleepable context which should be provided now and might be the reason why it was not implemented so in the first place. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: allow different allocation context per _create_messageAlexander Aring4-16/+23
This patch allows to give the use control about the allocation context based on a per message basis. Currently all messages forced to be created under GFP_NOFS context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: use a non-static queue for callbacksAlexander Aring9-217/+222
This patch will introducde a queue implementation for callbacks by using the Linux lists. The current callback queue handling is implemented by a static limit of 6 entries, see DLM_CALLBACKS_SIZE. The sequence number inside the callback structure was used to see if the entries inside the static entry is valid or not. We don't need any sequence numbers anymore with a dynamic datastructure with grows and shrinks during runtime to offer such functionality. We assume that every callback will be delivered to the DLM user if once queued. Therefore the callback flag DLM_CB_SKIP was dropped and the check for skipping bast was moved before worker handling and not skip while the callback worker executes. This will reduce unnecessary queues of the callback worker. All last callback saves are pointers now and don't need to copied over. There is a reference counter for callback structures which will care about to free the callback structures at the right time if they are not referenced anymore. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: move last cast bast time to function callAlexander Aring1-6/+4
This patch moves the debugging information of the last cast and bast time when calling the last and bast function call. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: use spin lock instead of mutexAlexander Aring3-6/+6
There is no need to use a mutex in those hot path sections. We change it to spin lock to serve callbacks more faster by not allowing schedule. The locked sections will not be locked for a long time. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: convert ls_cb_mutex mutex to spinlockAlexander Aring3-8/+8
This patch converts the ls_cb_mutex mutex to a spinlock, there is no sleepable context when this lock is held. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: use list_first_entry marcoAlexander Aring1-1/+1
Instead of using list_entry() this patch moves to using the list_first_entry() macro. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: let dlm_add_cb queue work after resume onlyAlexander Aring1-2/+2
We should allow dlm_add_cb() to call queue_work() only after the recovery queued pending for delayed lkbs. This patch will move the switch LSFL_CB_DELAY after the delayed lkb work was processed. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fd: dlm: trace send/recv of dlm message and rcomAlexander Aring4-17/+56
This patch adds tracepoints for send and recv cases of dlm messages and dlm rcom messages. In case of send and dlm message we add the dlm rsb resource name this dlm messages belongs to. This has the advantage to follow dlm messages on a per lock basis. In case of recv message the resource name can be extracted by follow the send message sequence number. The dlm message DLM_MSG_PURGE doesn't belong to a lock request and will not set the resource name in a dlm_message trace. The same for all rcom messages. There is additional handling required for this debugging functionality which is tried to be small as possible. Also the midcomms layer gets aware of lock resource names, for now this is required to make a connection between sequence number and lock resource names. It is for debugging purpose only. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: use packet in dlm_mhandleAlexander Aring1-3/+3
To allow more than just dereferencing the inner header we directly point to the inner dlm packet which allows us to dereference the header, rcom or message structure. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: remove send repeat remove handlingAlexander Aring1-74/+0
This patch removes the send repeat remove handling. This handling is there to repeatingly DLM_MSG_REMOVE messages in cases the dlm stack thinks it was not received at the first time. In cases of message drops this functionality is necessary, but since the DLM midcomms layer guarantees there are no messages drops between cluster nodes this feature became not strict necessary anymore. Due message delays/processing it could be that two send_repeat_remove() are sent out while the other should be still on it's way. We remove the repeat remove handling because we are sure that the message cannot be dropped due communication errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: retry accept() until -EAGAIN or error returnsAlexander Aring1-1/+5
This patch fixes a race if we get two times an socket data ready event while the listen connection worker is queued. Currently it will be served only once but we need to do it (in this case twice) until we hit -EAGAIN which tells us there is no pending accept going on. This patch wraps an do while loop until we receive a return value which is different than 0 as it was done before commit d11ccd451b65 ("fs: dlm: listen socket out of connection hash"). Cc: stable@vger.kernel.org Fixes: d11ccd451b65 ("fs: dlm: listen socket out of connection hash") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08fs: dlm: fix sock release if listen failsAlexander Aring1-2/+1
This patch fixes a double sock_release() call when the listen() is called for the dlm lowcomms listen socket. The caller of dlm_listen_for_all should never care about releasing the socket if dlm_listen_for_all() fails, it's done now only once if listen() fails. Cc: stable@vger.kernel.org Fixes: 2dc6b1158c28 ("fs: dlm: introduce generic listen") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-11-08dlm: replace one-element array with fixed size arrayPaulo Miguel Almeida2-2/+2
One-element arrays are deprecated. So, replace one-element array with fixed size array member in struct dlm_ls, and refactor the rest of the code, accordingly. Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/228 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101836 Link: https://lore.kernel.org/lkml/Y0W5jkiXUkpNl4ap@mail.google.com/ Signed-off-by: Paulo Miguel Almeida <paulo.miguel.almeida.rodenas@gmail.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Teigland <teigland@redhat.com>
2022-10-04Merge tag 'net-next-6.1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Introduce and use a single page frag cache for allocating small skb heads, clawing back the 10-20% performance regression in UDP flood test from previous fixes. - Run packets which already went thru HW coalescing thru SW GRO. This significantly improves TCP segment coalescing and simplifies deployments as different workloads benefit from HW or SW GRO. - Shrink the size of the base zero-copy send structure. - Move TCP init under a new slow / sleepable version of DO_ONCE(). BPF: - Add BPF-specific, any-context-safe memory allocator. - Add helpers/kfuncs for PKCS#7 signature verification from BPF programs. - Define a new map type and related helpers for user space -> kernel communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF). - Allow targeting BPF iterators to loop through resources of one task/thread. - Add ability to call selected destructive functions. Expose crash_kexec() to allow BPF to trigger a kernel dump. Use CAP_SYS_BOOT check on the loading process to judge permissions. - Enable BPF to collect custom hierarchical cgroup stats efficiently by integrating with the rstat framework. - Support struct arguments for trampoline based programs. Only structs with size <= 16B and x86 are supported. - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping sockets (instead of just TCP and UDP sockets). - Add a helper for accessing CLOCK_TAI for time sensitive network related programs. - Support accessing network tunnel metadata's flags. - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open. - Add support for writing to Netfilter's nf_conn:mark. Protocols: - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation (MLO) work (802.11be, WiFi 7). - vsock: improve support for SO_RCVLOWAT. - SMC: support SO_REUSEPORT. - Netlink: define and document how to use netlink in a "modern" way. Support reporting missing attributes via extended ACK. - IPSec: support collect metadata mode for xfrm interfaces. - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST packets. - TCP: introduce optional per-netns connection hash table to allow better isolation between namespaces (opt-in, at the cost of memory and cache pressure). - MPTCP: support TCP_FASTOPEN_CONNECT. - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior. - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets. - Open vSwitch: - Allow specifying ifindex of new interfaces. - Allow conntrack and metering in non-initial user namespace. - TLS: support the Korean ARIA-GCM crypto algorithm. - Remove DECnet support. Driver API: - Allow selecting the conduit interface used by each port in DSA switches, at runtime. - Ethernet Power Sourcing Equipment and Power Device support. - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per traffic class max frame size for time-based packet schedules. - Support PHY rate matching - adapting between differing host-side and link-side speeds. - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode. - Validate OF (device tree) nodes for DSA shared ports; make phylink-related properties mandatory on DSA and CPU ports. Enforcing more uniformity should allow transitioning to phylink. - Require that flash component name used during update matches one of the components for which version is reported by info_get(). - Remove "weight" argument from driver-facing NAPI API as much as possible. It's one of those magic knobs which seemed like a good idea at the time but is too indirect to use in practice. - Support offload of TLS connections with 256 bit keys. New hardware / drivers: - Ethernet: - Microchip KSZ9896 6-port Gigabit Ethernet Switch - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs - Analog Devices ADIN1110 and ADIN2111 industrial single pair Ethernet (10BASE-T1L) MAC+PHY. - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP). - Ethernet SFPs / modules: - RollBall / Hilink / Turris 10G copper SFPs - HALNy GPON module - WiFi: - CYW43439 SDIO chipset (brcmfmac) - CYW89459 PCIe chipset (brcmfmac) - BCM4378 on Apple platforms (brcmfmac) Drivers: - CAN: - gs_usb: HW timestamp support - Ethernet PHYs: - lan8814: cable diagnostics - Ethernet NICs: - Intel (100G): - implement control of FCS/CRC stripping - port splitting via devlink - L2TPv3 filtering offload - nVidia/Mellanox: - tunnel offload for sub-functions - MACSec offload, w/ Extended packet number and replay window offload - significantly restructure, and optimize the AF_XDP support, align the behavior with other vendors - Huawei: - configuring DSCP map for traffic class selection - querying standard FEC statistics - querying SerDes lane number via ethtool - Marvell/Cavium: - egress priority flow control - MACSec offload - AMD/SolarFlare: - PTP over IPv6 and raw Ethernet - small / embedded: - ax88772: convert to phylink (to support SFP cages) - altera: tse: convert to phylink - ftgmac100: support fixed link - enetc: standard Ethtool counters - macb: ZynqMP SGMII dynamic configuration support - tsnep: support multi-queue and use page pool - lan743x: Rx IP & TCP checksum offload - igc: add xdp frags support to ndo_xdp_xmit - Ethernet high-speed switches: - Marvell (prestera): - support SPAN port features (traffic mirroring) - nexthop object offloading - Microchip (sparx5): - multicast forwarding offload - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets) - Ethernet embedded switches: - Marvell (mv88e6xxx): - support RGMII cmode - NXP (felix): - standardized ethtool counters - Microchip (lan966x): - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets) - traffic policing and mirroring - link aggregation / bonding offload - QUSGMII PHY mode support - Qualcomm 802.11ax WiFi (ath11k): - cold boot calibration support on WCN6750 - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - Wake-on-WLAN support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz - MediaTek WiFi (mt76): - WiFi-to-Ethernet bridging offload for MT7986 chips - RealTek WiFi (rtw89): - P2P support" * tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits) eth: pse: add missing static inlines once: rename _SLOW to _SLEEPABLE net: pse-pd: add regulator based PSE driver dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller ethtool: add interface to interact with Ethernet Power Equipment net: mdiobus: search for PSE nodes by parsing PHY nodes. net: mdiobus: fwnode_mdiobus_register_phy() rework error handling net: add framework to support Ethernet PSE and PDs devices dt-bindings: net: phy: add PoDL PSE property net: marvell: prestera: Propagate nh state from hw to kernel net: marvell: prestera: Add neighbour cache accounting net: marvell: prestera: add stub handler neighbour events net: marvell: prestera: Add heplers to interact with fib_notifier_info net: marvell: prestera: Add length macros for prestera_ip_addr net: marvell: prestera: add delayed wq and flush wq on deinit net: marvell: prestera: Add strict cleanup of fib arbiter net: marvell: prestera: Add cleanup of allocated fib_nodes net: marvell: prestera: Add router nexthops ABI eth: octeon: fix build after netif_napi_add() changes net/mlx5: E-Switch, Return EBUSY if can't get mode lock ...
2022-09-26fs: dlm: fix possible use after free if tracingAlexander Aring1-7/+8
This patch fixes a possible use after free if tracing for the specific event is enabled. To avoid the use after free we introduce a out_put label like all other user lock specific requests and safe in a boolean to do a put or not which depends on the execution path of dlm_user_request(). Cc: stable@vger.kernel.org Fixes: 7a3de7324c2b ("fs: dlm: trace user space callbacks") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-29genetlink: start to validate reserved header bytesJakub Kicinski1-0/+1
We had historically not checked that genlmsghdr.reserved is 0 on input which prevents us from using those precious bytes in the future. One use case would be to extend the cmd field, which is currently just 8 bits wide and 256 is not a lot of commands for some core families. To make sure that new families do the right thing by default put the onus of opting out of validation on existing families. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel) Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-23fs: dlm: const void resource name parameterAlexander Aring2-11/+14
The resource name parameter should never be changed by DLM so we declare it as const. At some point it is handled as a char pointer, a resource name can be a non printable ascii string as well. This patch change it to handle it as void pointer as it is offered by DLM API. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: LSFL_CB_DELAY only for kernel lockspacesAlexander Aring1-6/+7
This patch only set/clear the LSFL_CB_DELAY bit when it's actually a kernel lockspace signaled by if ls->ls_callback_wq is set or not set in this case. User lockspaces will never evaluate this flag. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: remove DLM_LSFL_FS from uapiAlexander Aring3-7/+40
The DLM_LSFL_FS flag is set in lockspaces created directly for a kernel user, as opposed to those lockspaces created for user space applications. The user space libdlm allowed this flag to be set for lockspaces created from user space, but then used by a kernel user. No kernel user has ever used this method, so remove the ability to do it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: trace user space callbacksAlexander Aring2-5/+26
This patch adds trace callbacks for user locks. Unfortenately user locks are handled in a different way than kernel locks in some cases. User locks never call the dlm_lock()/dlm_unlock() kernel API and use the next step internal API of dlm. Adding those traces from user API callers should make it possible for dlm trace system to see lock handling for user locks as well. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: change ls_clear_proc_locks to spinlockAlexander Aring4-8/+8
This patch changes the ls_clear_proc_locks to a spinlock because there is no need to handle it as a mutex as there is no sleepable context when ls_clear_proc_locks is held. This allows us to call those functionality in non-sleepable contexts. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: remove dlm_del_ast prototypeAlexander Aring1-1/+0
This patch removes dlm_del_ast() prototype which is not being used in the dlm subsystem because there is not implementation for it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: handle rcom in else if branchAlexander Aring1-1/+4
Currently we handle in dlm_receive_buffer() everything else than a DLM_MSG type as DLM_RCOM message. Although a different message than DLM_MSG should be a DLM_RCOM we should explicit check on DLM_RCOM and drop a log_error() if we see something unexpected. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: allow lockspaces have zero lvblenAlexander Aring1-1/+1
A dlm user may not use the DLM_LKF_VALBLK flag in the DLM API, so a zero lvblen should be allowed as a lockspace parameter. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: fix invalid derefence of sb_lvbptrAlexander Aring1-1/+1
I experience issues when putting a lkbsb on the stack and have sb_lvbptr field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash with the following kernel message, the dangled pointer is here 0xdeadbeef as example: [ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef [ 102.749320] #PF: supervisor read access in kernel mode [ 102.749323] #PF: error_code(0x0000) - not-present page [ 102.749325] PGD 0 P4D 0 [ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI [ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565 [ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014 [ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10 [ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe [ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202 [ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040 [ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070 [ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001 [ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70 [ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001 [ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000 [ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0 [ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 102.749379] PKRU: 55555554 [ 102.749381] Call Trace: [ 102.749382] <TASK> [ 102.749383] ? send_args+0xb2/0xd0 [ 102.749389] send_common+0xb7/0xd0 [ 102.749395] _unlock_lock+0x2c/0x90 [ 102.749400] unlock_lock.isra.56+0x62/0xa0 [ 102.749405] dlm_unlock+0x21e/0x330 [ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture] [ 102.749419] ? preempt_count_sub+0xba/0x100 [ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture] [ 102.786186] kthread+0x10a/0x130 [ 102.786581] ? kthread_complete_and_exit+0x20/0x20 [ 102.787156] ret_from_fork+0x22/0x30 [ 102.787588] </TASK> [ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw [ 102.792536] CR2: 00000000deadbeef [ 102.792930] ---[ end trace 0000000000000000 ]--- This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags is set when copying the lvbptr array instead of if it's just null which fixes for me the issue. I think this patch can fix other dlm users as well, depending how they handle the init, freeing memory handling of sb_lvbptr and don't set DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a hidden issue all the time. However with checking on DLM_LKF_VALBLK the user always need to provide a sb_lvbptr non-null value. There might be more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and non-null to report the user the way how DLM API is used is wrong but can be added for later, this will only fix the current behaviour. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: handle -EINVAL as log_error()Alexander Aring1-2/+30
If the user generates -EINVAL it's probably because they are using DLM incorrectly. Change the log level to make these errors more visible. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: use __func__ for function nameAlexander Aring1-2/+2
Avoid hard-coded function names inside message format strings. (Prevents checkpatch warnings.) Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: handle -EBUSY first in unlock validationAlexander Aring1-22/+22
This patch checks for -EBUSY conditions in dlm_unlock() before checking for -EINVAL conditions (except for CANCEL and FORCEUNLOCK calls where a busy condition is expected.) There are no problems with the current ordering of checks, but this makes dlm_unlock() consistent with dlm_lock(), and may avoid future problems if other checks are added. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: handle -EBUSY first in lock arg validationAlexander Aring1-9/+9
During lock arg validation, first check for -EBUSY cases, then for -EINVAL cases. The -EINVAL checks look at lkb state variables which are not stable when an lkb is busy and would cause an -EBUSY result, e.g. lkb->lkb_grmode. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: fix race between test_bit() and queue_work()Alexander Aring1-2/+4
This patch fixes a race by using ls_cb_mutex around the bit operations and conditional code blocks for LSFL_CB_DELAY. The function dlm_callback_stop() expects to stop all callbacks and flush all currently queued onces. The set_bit() is not enough because there can still be queue_work() after the workqueue was flushed. To avoid queue_work() after set_bit(), surround both by ls_cb_mutex. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-23fs: dlm: fix race in lowcommsAlexander Aring1-0/+4
This patch fixes a race between queue_work() in _dlm_lowcomms_commit_msg() and srcu_read_unlock(). The queue_work() can take the final reference of a dlm_msg and so msg->idx can contain garbage which is signaled by the following warning: [ 676.237050] ------------[ cut here ]------------ [ 676.237052] WARNING: CPU: 0 PID: 1060 at include/linux/srcu.h:189 dlm_lowcomms_commit_msg+0x41/0x50 [ 676.238945] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common iTCO_wdt iTCO_vendor_support qxl kvm_intel drm_ttm_helper vmw_vsock_virtio_transport kvm vmw_vsock_virtio_transport_common ttm irqbypass crc32_pclmul joydev crc32c_intel serio_raw drm_kms_helper vsock virtio_scsi virtio_console virtio_balloon snd_pcm drm syscopyarea sysfillrect sysimgblt snd_timer fb_sys_fops i2c_i801 lpc_ich snd i2c_smbus soundcore pcspkr [ 676.244227] CPU: 0 PID: 1060 Comm: lock_torture_wr Not tainted 5.19.0-rc3+ #1546 [ 676.245216] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014 [ 676.246460] RIP: 0010:dlm_lowcomms_commit_msg+0x41/0x50 [ 676.247132] Code: fe ff ff ff 75 24 48 c7 c6 bd 0f 49 bb 48 c7 c7 38 7c 01 bd e8 00 e7 ca ff 89 de 48 c7 c7 60 78 01 bd e8 42 3d cd ff 5b 5d c3 <0f> 0b eb d8 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 [ 676.249253] RSP: 0018:ffffa401c18ffc68 EFLAGS: 00010282 [ 676.249855] RAX: 0000000000000001 RBX: 00000000ffff8b76 RCX: 0000000000000006 [ 676.250713] RDX: 0000000000000000 RSI: ffffffffbccf3a10 RDI: ffffffffbcc7b62e [ 676.251610] RBP: ffffa401c18ffc70 R08: 0000000000000001 R09: 0000000000000001 [ 676.252481] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000005 [ 676.253421] R13: ffff8b76786ec370 R14: ffff8b76786ec370 R15: ffff8b76786ec480 [ 676.254257] FS: 0000000000000000(0000) GS:ffff8b7777800000(0000) knlGS:0000000000000000 [ 676.255239] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 676.255897] CR2: 00005590205d88b8 CR3: 000000017656c003 CR4: 0000000000770ee0 [ 676.256734] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 676.257567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 676.258397] PKRU: 55555554 [ 676.258729] Call Trace: [ 676.259063] <TASK> [ 676.259354] dlm_midcomms_commit_mhandle+0xcc/0x110 [ 676.259964] queue_bast+0x8b/0xb0 [ 676.260423] grant_pending_locks+0x166/0x1b0 [ 676.261007] _unlock_lock+0x75/0x90 [ 676.261469] unlock_lock.isra.57+0x62/0xa0 [ 676.262009] dlm_unlock+0x21e/0x330 [ 676.262457] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 676.263183] torture_unlock+0x5a/0x90 [dlm_locktorture] [ 676.263815] ? preempt_count_sub+0xba/0x100 [ 676.264361] ? complete+0x1d/0x60 [ 676.264777] lock_torture_writer+0xb8/0x150 [dlm_locktorture] [ 676.265555] kthread+0x10a/0x130 [ 676.266007] ? kthread_complete_and_exit+0x20/0x20 [ 676.266616] ret_from_fork+0x22/0x30 [ 676.267097] </TASK> [ 676.267381] irq event stamp: 9579855 [ 676.267824] hardirqs last enabled at (9579863): [<ffffffffbb14e6f8>] __up_console_sem+0x58/0x60 [ 676.268896] hardirqs last disabled at (9579872): [<ffffffffbb14e6dd>] __up_console_sem+0x3d/0x60 [ 676.270008] softirqs last enabled at (9579798): [<ffffffffbc200349>] __do_softirq+0x349/0x4c7 [ 676.271438] softirqs last disabled at (9579897): [<ffffffffbb0d54c0>] irq_exit_rcu+0xb0/0xf0 [ 676.272796] ---[ end trace 0000000000000000 ]--- I reproduced this warning with dlm_locktorture test which is currently not upstream. However this patch fix the issue by make a additional refcount between dlm_lowcomms_new_msg() and dlm_lowcomms_commit_msg(). In case of the race the kref_put() in dlm_lowcomms_commit_msg() will be the final put. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-01fs: dlm: move kref_put assert for lkb structsAlexander Aring1-3/+8
The unhold_lkb() function decrements the lock's kref, and asserts that the ref count was not the final one. Use the kref_put release function (which should not be called) to call the assert, rather than doing the assert based on the kref_put return value. Using kill_lkb() as the release function doesn't make sense if we only want to assert. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-01fs: dlm: don't use deprecated timeout features by defaultAlexander Aring8-5/+118
This patch will disable use of deprecated timeout features if CONFIG_DLM_DEPRECATED_API is not set. The deprecated features will be removed in upcoming kernel release v6.2. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-01fs: dlm: add deprecation Kconfig and warnings for timeoutsAlexander Aring3-1/+27
This patch adds a CONFIG_DLM_DEPRECATED_API Kconfig option that must be enabled to use two timeout-related features that we intend to remove in kernel v6.2. Warnings are printed if either is enabled and used. Neither has ever been used as far as we know. . The DLM_LSFL_TIMEWARN lockspace creation flag will be removed, along with the associated configfs entry for setting the timeout. Setting the flag and configfs file would cause dlm to track how long locks were waiting for reply messages. After a timeout, a kernel message would be logged, and a netlink message would be sent to userspace. Recently, midcomms messages have been added that produce much better logging about actual problems with messages. No use has ever been found for the netlink messages. . The userspace libdlm API has allowed the DLM_LKF_TIMEOUT flag with a timeout value to be set in lock requests. The lock request would be cancelled after the timeout. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: remove timeout from dlm_user_adopt_orphanAlexander Aring3-3/+2
Remove the unused timeout parameter from dlm_user_adopt_orphan(). Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: remove waiter warningsAlexander Aring6-91/+0
This patch removes warning messages that could be logged when remote requests had been waiting on a reply message for some timeout period (which could be set through configfs, but was rarely enabled.) The improved midcomms layer now carefully tracks all messages and replies, and logs much more useful messages if there is an actual problem. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: fix grammar in lowcomms outputAlexander Aring1-2/+2
This patch fixes some grammar output in lowcomms implementation by removing the "successful" word which should be "successfully" but it can never be unsuccessfully so we remove it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: add comment about lkb IFL flagsAlexander Aring1-0/+8
This patch adds comments about the difference between the lower 2 bytes of lkb flags and the 2 upper bytes of the lkb IFL flags. In short the upper 2 bytes will be handled as internal flags whereas the lower 2 bytes are part of the DLM protocol and are used to exchange messages. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: handle recovery result outside of ls_recoverAlexander Aring1-16/+26
This patch cleans up the handling of recovery results by moving it from ls_recover() to the caller do_ls_recovery(). This makes the error handling clearer. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: make new_lockspace() wait until recovery completesAlexander Aring4-19/+20
Make dlm_new_lockspace() wait until a full recovery completes sucessfully or fails. Previously, dlm_new_lockspace() returned to the caller after dlm_recover_members() finished, which is only partially through recovery. The result of the previous behavior is that the new lockspace would not be usable for some time (especially with overlapping recoveries), and some errors in the later part of recovery could not be returned to the caller. Kernel callers gfs2 and cluster-md have their own wait handling to wait for recovery to complete after calling dlm_new_lockspace(). This continues to work, but will be unnecessary. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: call dlm_lsop_recover_prep onceAlexander Aring1-1/+10
A lockspace can be "stopped" multiple times consecutively before being "started" (when recoveries overlap.) In this case, the lsop_recover_prep callback only needs to be called once when the lockspace is first stopped, and not repeatedly for each stop. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: update comments about recovery and membership handlingAlexander Aring2-1/+9
Make clear that a particular recovery iteration must not be aborted before membership changes are applied to the members list (ls_nodes) and midcomms layer. Interrupting recovery before this can result in missing node-specific changes in midcomms or through lsops. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: add resource name to tracepointsAlexander Aring1-2/+2
This patch adds the resource name to dlm tracepoints. The name usually comes through the lkb_resource, but in some cases a resource may not yet be associated with an lkb, in which case the name and namelen parameters are used. It should be okay to access the lkb_resource and the res_name field at the time when the tracepoint is invoked. The resource is assigned to a lkb and it's reference is being held during the tracepoint call. During this time the resource cannot be freed. Also a lkb will never switch its assigned resource. The name of a dlm_rsb is assigned at creation time and should never be changed during runtime as well. The TP_printk() call uses always a hexadecimal string array representation for the resource name (which is not necessarily ascii.) Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: remove additional dereference of lksbAlexander Aring1-1/+1
This patch removes a dereference of lksb of lkb when calling ast tracepoint. First it reduces additional overhead, even if traces are not active. Second we can deference it in TP_fast_assign from the existing lkb parameter. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2022-06-24fs: dlm: change ast and bast trace orderAlexander Aring1-2/+2
This patch moves the trace calls for ast and bast to before the ast and bast callback functions are called rather than after. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>