aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-01-29Revert "btrfs: synchronize incompat feature bits with sysfs files"HEADmasterChris Mason4-17/+0
This reverts commit 14e46e04958df740c6c6a94849f176159a333f13. This ends up doing sysfs operations from deep in balance (where we should be GFP_NOFS) and under heavy balance load, we're making races against sysfs internals. Revert it for now while we figure things out. Signed-off-by: Chris Mason <clm@fb.com>
2016-01-27btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzallocChris Mason1-1/+1
This was copied incorrectly from the __vmalloc call. Signed-off-by: Chris Mason <clm@fb.com>
2016-01-27Merge branch 'dev/fst-followup' of ↵Chris Mason6-18/+34
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-27btrfs: sysfs: check initialization state before updating featuresDavid Sterba1-0/+3
If the mount phase is not finished, we can't update the sysfs files. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"David Sterba1-1/+0
This reverts commit 696249132158014d594896df3a81390616069c5c. The cleaner thread can block freezing when there's a snapshot cleaning in progress and the other threads get suspended first. From the logs provided by Martin we're waiting for reading extent pages: kernel: PM: Syncing filesystems ... done. kernel: Freezing user space processes ... (elapsed 0.015 seconds) done. kernel: Freezing remaining freezable tasks ... kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0): kernel: btrfs-cleaner D ffff88033dd13bc0 0 152 2 0x00000000 kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451 kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020 kernel: Call Trace: kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8 kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44 kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2 kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2 kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30 kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73 kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72 kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203 kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9 kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45 kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736 kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7 kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4 kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664 kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167 kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0 kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33 kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16 kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70 kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16 As this affects a released kernel (4.4) we need a minimal fix for stable kernels. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361 Reported-by: Martin Ziegler <ziegler@uni-freiburg.de> CC: stable@vger.kernel.org # 4.4 CC: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25btrfs: async-thread: Fix a use-after-free error for traceQu Wenruo1-1/+1
Parameter of trace_btrfs_work_queued() can be freed in its workqueue. So no one use use that pointer after queue_work(). Fix the user-after-free bug by move the trace line before queue_work(). Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25Btrfs: fix race between fsync and lockless direct IO writesFilipe Manana2-11/+39
An fsync, using the fast path, can race with a concurrent lockless direct IO write and end up logging a file extent item that points to an extent that wasn't written to yet. This is because the fast fsync path collects ordered extents into a local list and then collects all the new extent maps to log file extent items based on them, while the direct IO write path creates the new extent map before it creates the corresponding ordered extent (and submitting the respective bio(s)). So fix this by making the direct IO write path create ordered extents before the extent maps and make the fast fsync path collect any new ordered extents after it collects the extent maps. Note that making the fsync handler call inode_dio_wait() (after acquiring the inode's i_mutex) would not work and lead to a deadlock when doing AIO, as through AIO we end up in a path where the fsync handler is called (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range()) before the inode's dio counter is decremented (inode_dio_wait() waits for this counter to have a value of zero). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25Merge branch 'fix/fst-sysfs' of ↵Chris Mason6-1/+53
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2016-01-25btrfs: add free space tree to the cow-only listDavid Sterba1-1/+2
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-25btrfs: add free space tree to lockdep classesDavid Sterba1-0/+1
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-22btrfs: tweak free space tree bitmap allocationDavid Sterba1-2/+16
The requested bitmap size varies, observed numbers were < 4K up to 16K. Using vmalloc unconditionally would be too heavy, we'll try contiguous allocations first and fall back to vmalloc if there's no contig memory. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-22btrfs: tests: switch to GFP_KERNELDavid Sterba3-15/+15
There's no reason to do GFP_NOFS in tests, it's not data-heavy and memory allocation failures would affect only developers or testers. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21btrfs: synchronize incompat feature bits with sysfs filesDavid Sterba4-0/+17
The files under /sys/fs/UUID/features get out of sync with the actual incompat bits set for the filesystem if they change after mount (eg. the LZO compression). Synchronize the feature bits with the sysfs files representing them right after we set/clear them. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21btrfs: sysfs: introduce helper for syncing bits with sysfs filesDavid Sterba2-0/+33
The files under /sys/fs/UUID/features get out of sync with the actual incompat bits set for the filesystem if they change after mount. We're going to sync them and need a helper to do that. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-21btrfs: sysfs: add free-space-tree bit attributeDavid Sterba1-0/+2
The incompat bit representing the newly added free space tree feature is missing. Right now it will be listed only among features supported by the module, not per-fs. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-20btrfs: sysfs: fix typo in compat_ro attribute definitionDavid Sterba1-1/+1
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-20btrfs: raid56: Use raid_write_end_io for scrubZhao Lei1-27/+5
No need to create additional end_io function for scrub, it increased code size and introduced some un-unified lines, as: raid_write_parity_end_io(): int err = bio->bi_error; if (bio->bi_error) raid_write_end_io(): int err = bio->bi_error; if (err) This patch combines them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Remove unnecessary ClearPageUptodate for raid56Zhao Lei1-2/+0
PageUptodate flag already initialized to 0 for new page, no need to set it again. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: use rbio->nr_pages to reduce calculationZhao Lei1-12/+7
We can use rbio->stripe_npages to reduce unnecessary calculation in many code place. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Use unified stripe_page's index calculationZhao Lei1-22/+21
We are using different index calculation method for stripe_page in current code: 1: (rbio->stripe_len / PAGE_CACHE_SIZE) * stripe_index + page_index 2: DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE) * stripe_index + page_index 3: DIV_ROUND_UP(rbio->stripe_len * stripe_index, PAGE_CACHE_SIZE) + page_index ... They can get same result when stripe_len align to PAGE_CACHE_SIZE, this is why current code can work, intruduce and use a common function for calculation is a better choose. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Fix calculation of rbio->dbitmap's size calculationZhao Lei2-3/+3
Current code is trying to calculate rbio->dbitmap's size to make it align to sizeof(long), but implement haven't achived this object, it is align to sizeof(char) instead. This patch fixed above calculation, and use sizeof(long) instead of fixed "8" to increate compatibility. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: Fix no_space in write and rm loopZhao Lei1-1/+3
I see no_space in v4.4-rc1 again in xfstests generic/102. It happened randomly in some node only. (one of 4 phy-node, and a kvm with non-virtio block driver) By bisect, we can found the first-bad is: commit bdced438acd8 ("block: setup bi_phys_segments after splitting")' But above patch only triggered the bug by making bio operation faster(or slower). Main reason is in our space_allocating code, we need to commit page writeback before wait it complish, this patch fixed above bug. BTW, there is another reason for generic/102 fail, caused by disable default mixed-blockgroup, I'll fix it in xfstests. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: merge functions for wait snapshot creationZhao Lei3-21/+22
wait_for_snapshot_creation() is in same group with oher two: btrfs_start_write_no_snapshoting() btrfs_end_write_no_snapshoting() Rename wait_for_snapshot_creation() and move it into same place with other two. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-20btrfs: delete unused argument in btrfs_copy_from_userZhao Lei1-4/+2
size_t write_bytes is not necessary for btrfs_copy_from_user(), delete it. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19btrfs: Use direct way to determine raid56 write/recover modeZhao Lei1-1/+2
Old code used bbio->raid_map to determine whether in raid56 write/recover operation, because we didn't't have bbio->map_type. Now we have direct way for this condition, rid of using the function-relative data, and make the code more readable. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19btrfs: Small cleanup for get index_srcdev loopZhao Lei1-22/+20
1: Adjust condition in loop to make less TAB 2: Move btrfs_put_bbio()'s line for combine, and makes logic clean. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19btrfs: Enhance chunk validation checkQu Wenruo1-1/+32
Enhance chunk validation: 1) Num_stripes We already have such check but it's only in super block sys chunk array. Now check all on-disk chunks. 2) Chunk logical It should be aligned to sector size. This behavior should be *DOUBLE CHECKED* for 64K sector size like PPC64 or AArch64. Maybe we can found some hidden bugs. 3) Chunk length Same as chunk logical, should be aligned to sector size. 4) Stripe length It should be power of 2. 5) Chunk type Any bit out of TYPE_MAS | PROFILE_MASK is invalid. With all these much restrict rules, several fuzzed image reported in mail list should no longer cause kernel panic. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19btrfs: Enhance super validation checkQu Wenruo1-49/+48
Enhance btrfs_check_super_valid() function by the following points: 1) Restrict sector/node size check Not the old max/min valid check, but also check if it's a power of 2. So some bogus number like 12K node size won't pass now. 2) Super flag check For now, there is still some inconsistency between kernel and btrfs-progs super flags. And considering btrfs-progs may add new flags for super block, this check will only output warning. 3) Better root alignment check Now root bytenr is checked against sector size. 4) Move some check into btrfs_check_super_valid(). Like node size vs leaf size check, and PAGESIZE vs sectorsize check. And magic number check. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19Btrfs: fix deadlock running delayed iputs at transaction commit timeFilipe Manana4-8/+10
While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79 [ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b [ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108 [ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d [ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs] [ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c [ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266 [ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31 [ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c [ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76 [ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff [ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b [ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71 [ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19Btrfs: fix typo in log message when starting a balanceFilipe Manana1-1/+1
The recent change titled "Btrfs: Check metadata redundancy on balance" (already in linux-next) left a typo in a message for users: metatdata -> metadata. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19Merge branch 'misc-for-4.5' of ↵Chris Mason6-15/+56
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-19Merge branch 'misc-cleanups-4.5' of ↵Chris Mason7-24/+32
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-19btrfs: remove duplicate const specifierColin Ian King1-1/+1
duplicate const is redundant so remove it Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15btrfs: initialize the seq counter in struct btrfs_deviceSebastian Andrzej Siewior1-0/+1
I managed to trigger this: | INFO: trying to register non-static key. | the code is fine but needs lockdep annotation. | turning off the locking correctness validator. | CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14 | Hardware name: ARM-Versatile Express | [<80307cec>] (dump_stack) | [<80070e98>] (__lock_acquire) | [<8007184c>] (lock_acquire) | [<80287800>] (btrfs_ioctl) | [<8012a8d4>] (do_vfs_ioctl) | [<8012ac14>] (SyS_ioctl) so I think that btrfs_device_data_ordered_init() is not invoked behind a macro somewhere. Fixes: 7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15Btrfs: clean up an error code in btrfs_init_space_info()Dan Carpenter1-1/+1
If we return 1 here, then the caller treats it as an error and returns -EINVAL. It causes a static checker warning to treat positive returns as an error. Fixes: 1aba86d67f34 ('Btrfs: fix easily get into ENOSPC in mixed case') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15btrfs: fix iterator with update error in backref.cGeliang Tang1-5/+5
Fix the following error: fs/btrfs/backref.c:565:1-20: iterator with update on line 577 Fixes: a7ca422('btrfs: use list_for_each_entry* in backref.c') Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15Btrfs: fix output of compression message in btrfs_parse_options()Tsutomu Itoh1-7/+22
The compression message might not be correctly output. Fix it. [[before fix]] # mount -o compress /dev/sdb3 /test3 [ 996.874264] BTRFS info (device sdb3): disk space caching is enabled [ 996.874268] BTRFS: has skinny extents # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress-force /dev/sdb3 /test3 [ 1035.075017] BTRFS info (device sdb3): force zlib compression [ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress /dev/sdb3 /test3 [ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) [[after fix]] # mount -o compress /dev/sdb3 /test3 [ 401.021753] BTRFS info (device sdb3): use zlib compression [ 401.021758] BTRFS info (device sdb3): disk space caching is enabled [ 401.021760] BTRFS: has skinny extents # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress-force /dev/sdb3 /test3 [ 439.824624] BTRFS info (device sdb3): force zlib compression [ 439.824629] BTRFS info (device sdb3): disk space caching is enabled # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress /dev/sdb3 /test3 [ 459.918430] BTRFS info (device sdb3): use zlib compression [ 459.918434] BTRFS info (device sdb3): disk space caching is enabled # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and ↵Chandan Rajendra4-8/+33
subvolume roots The following call trace is seen when btrfs/031 test is executed in a loop, [ 158.661848] ------------[ cut here ]------------ [ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea() [ 158.664102] BTRFS: Transaction aborted (error -2) [ 158.664774] Modules linked in: [ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2 [ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 158.667392] ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930 [ 158.668515] ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000 [ 158.669647] ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980 [ 158.670769] Call Trace: [ 158.671153] [<ffffffff81431fc8>] dump_stack+0x44/0x5c [ 158.671884] [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0 [ 158.672769] [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50 [ 158.673620] [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea [ 158.674440] [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520 [ 158.675376] [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50 [ 158.676235] [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180 [ 158.677268] [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70 [ 158.678183] [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90 [ 158.678975] [<ffffffff81144b8e>] ? vma_merge+0xee/0x300 [ 158.679751] [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170 [ 158.680599] [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70 [ 158.681686] [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0 [ 158.682581] [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490 [ 158.683399] [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60 [ 158.684297] [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80 [ 158.685051] [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 158.685958] ---[ end trace 4b63312de5a2cb76 ]--- [ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry [ 158.709508] BTRFS info (device loop0): forced readonly [ 158.737113] BTRFS info (device loop0): disk space caching is enabled [ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed [ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30 This occurs because, Mount filesystem Create subvol with ID 257 Unmount filesystem Mount filesystem Delete subvol with ID 257 btrfs_drop_snapshot() Add root corresponding to subvol 257 into btrfs_transaction->dropped_roots list Create new subvol (i.e. create_subvol()) 257 is returned as the next free objectid btrfs_read_fs_root_no_name() Finds the btrfs_root instance corresponding to the old subvol with ID 257 in btrfs_fs_info->fs_roots_radix. Returns error since btrfs_root_item->refs has the value of 0. To fix the issue the commit initializes tree root's and subvolume root's highest_objectid when loading the roots from disk. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15btrfs: cleanup, stop casting for extent_map->lookup everywhereJeff Mahoney6-17/+25
Overloading extent_map->bdev to struct map_lookup * might have started out as a means to an end, but it's a pattern that's used all over the place now. Let's get rid of the casting and just add a union instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-11Merge branch 'for-chris-4.5' of ↵Chris Mason4-18/+58
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11Merge branch 'misc-cleanups-4.5' of ↵Chris Mason24-305/+195
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11Merge branch 'misc-for-4.5' of ↵Chris Mason22-144/+186
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-07Btrfs: fix fitrim discarding device area reserved for boot loader's useFilipe Manana1-10/+10
As of the 4.3 kernel release, the fitrim ioctl can now discard any region of a disk that is not allocated to any chunk/block group, including the first megabyte which is used for our primary superblock and by the boot loader (grub for example). Fix this by not allowing to trim/discard any region in the device starting with an offset not greater than min(alloc_start_mount_option, 1Mb), just as it was not possible before 4.3. A reproducer test case for xfstests follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are # reserved for a boot loader to use (GRUB for example) and btrfs should never # use them - neither for allocating metadata/data nor should trim/discard them. # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem. $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io # Now mount the filesystem and perform a fitrim against it. _scratch_mount _require_batched_discard $SCRATCH_MNT $FSTRIM_PROG $SCRATCH_MNT # Now unmount the filesystem and verify the content of the ranges was not # modified (no trim/discard happened on them). _scratch_unmount echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:" od -t x1 -N $((64 * 1024)) $SCRATCH_DEV od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV status=0 exit Reported-by: Vincent Petry <PVince81@yahoo.fr> Reported-by: Andrei Borzenkov <arvidjaar@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341 Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM) Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-07Btrfs: Check metadata redundancy on balanceSam Tygier1-0/+7
When converting a filesystem via balance check that metadata mode is at least as redundant as the data mode. For example give warning when: -dconvert=raid1 -mconvert=single Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk> [ minor message reformatting ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: statfs: report zero available if metadata are exhaustedDavid Sterba1-0/+24
There is one ENOSPC case that's very confusing. There's Available greater than zero but no file operation succeds (besides removing files). This happens when the metadata are exhausted and there's no possibility to allocate another chunk. In this scenario it's normal that there's still some space in the data chunk and the calculation in df reflects that in the Avail value. To at least give some clue about the ENOSPC situation, let statfs report zero value in Avail, even if there's still data space available. Current: /dev/sdb1 4.0G 3.3G 719M 83% /mnt/test New: /dev/sdb1 4.0G 3.3G 0 100% /mnt/test We calculate the remaining metadata space minus global reserve. If this is (supposedly) smaller than zero, there's no space. But this does not hold in practice, the exhausted state happens where's still some positive delta. So we apply some guesswork and compare the delta to a 4M threshold. (Practically observed delta was 2M.) We probably cannot calculate the exact threshold value because this depends on the internal reservations requested by various operations, so some operations that consume a few metadata will succeed even if the Avail is zero. But this is better than the other way around. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: preallocate path for snapshot creation at ioctl timeDavid Sterba3-6/+8
We can also preallocate btrfs_path that's used during pending snapshot creation and avoid another late ENOMEM failure. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: allocate root item at snapshot ioctl timeDavid Sterba3-6/+13
The actual snapshot creation is delayed until transaction commit. If we cannot get enough memory for the root item there, we have to fail the whole transaction commit which is bad. So we'll allocate the memory at the ioctl call and pass it along with the pending_snapshot struct. The potential ENOMEM will be returned to the caller of snapshot ioctl. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: do an allocation earlier during snapshot creationDavid Sterba1-11/+9
We can allocate pending_snapshot earlier and do not have to do cleanup in case of failure. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use smaller type for btrfs_path locksDavid Sterba1-1/+1
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see: * overall size of btrfs_path drops down from 136 to 112 (-24 bytes), * better packing in a slab page +6 objects * the whole structure now fits to 2 cachelines * slight decrease in code size: text data bss dec hex filename 938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before 938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after (and the generated assembly does not change much) The main purpose is to decrease the size of the structure without affecting performance. The byte access is usually well behaving accross arches, the locks are not accessed frequently and sometimes just compared to zero. Note for further size reduction attempts: the slots could be made u16 but this might generate worse code on some arches (non-byte and non-int access). Also the range of operations on slots is wider compared to locks and the potential performance drop should be evaluated first. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use smaller type for btrfs_path lowest_levelDavid Sterba1-1/+1
The level is 0..7, we can use smaller type. The size of btrfs_path is now 136 bytes from 144, which is +2 objects that fit into a 4k slab. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use smaller type for btrfs_path readaDavid Sterba1-1/+1
The possible values for reada are all positive and bounded, we can later save some bytes by storing it in u8. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: cleanup, use enum values for btrfs_path readaDavid Sterba11-30/+30
Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: constify static arraysDavid Sterba4-4/+4
There are a few statically initialized arrays that can be made const. The remaining (like file_system_type, sysfs attributes or prop handlers) do not allow that due to type mismatch when passed to the APIs or because the structures are modified through other members. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: constify remaining structs with function pointersDavid Sterba5-8/+8
* struct extent_io_ops * struct btrfs_free_space_op Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs tests: replace whole ops structure for free space testsDavid Sterba1-6/+8
Preparatory work for making btrfs_free_space_op constant. In test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with own version thus preventing constification. We can rework it so we replace the whole structure with the correct function pointers. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use list_for_each_entry* in backref.cGeliang Tang1-17/+6
Use list_for_each_entry*() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use list_for_each_entry_safe in free-space-cache.cGeliang Tang1-10/+4
Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use list_for_each_entry* in check-integrity.cGeliang Tang1-79/+26
Use list_for_each_entry*() instead of list_for_each*() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07Btrfs: use linux/sizes.h to represent constantsByongho Lee17-177/+147
We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: cleanup, remove stray return statementsDavid Sterba6-9/+0
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: zero out delayed node upon allocationAlexandru Moise1-6/+1
It's slightly cleaner to zero-out the delayed node upon allocation than to do it by hand in btrfs_init_delayed_node() for a few members Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: pass proper enum type to start_transaction()Alexandru Moise1-5/+10
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: switch __btrfs_fs_incompat return type from int to boolAlexandru Moise1-1/+1
Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly returning boolean not int. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: remove unused inode argument from uncompress_inline()Byongho Lee1-3/+2
The inode argument is never used from the beginning, so remove it. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: don't use slab cache for struct btrfs_delalloc_workDavid Sterba1-12/+2
Although we prefer to use separate caches for various structs, it seems better not to do that for struct btrfs_delalloc_work. Objects of this type are allocated rarely, when transaction commit calls btrfs_start_delalloc_roots, requesting delayed iputs. The objects are temporary (with some IO involved) but still allocated and freed within __start_delalloc_inodes. Memory allocation failure is handled. The slab cache is empty most of the time (observed on several systems), so if we need to allocate a new slab object, the first one has to allocate a full page. In a potential case of low memory conditions this might fail with higher probability compared to using the generic slab caches. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: drop duplicate prefix from scrub workqueuesDavid Sterba1-5/+5
The helper btrfs_alloc_workqueue will add the "btrfs-" prefix. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: verbose error when we find an unexpected item in sys_arrayDavid Sterba1-0/+3
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: handle invalid num_stripes in sys_arrayDavid Sterba1-0/+8
We can handle the special case of num_stripes == 0 directly inside btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to catch other unhandled cases where we fail to validate external data. A crafted or corrupted image crashes at mount time: BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0 BTRFS info (device loop0): disk space caching is enabled BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()! Kernel panic - not syncing: BUG! CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25 Stack: 637af890 60062489 602aeb2e 604192ba 60387961 00000011 637af8a0 6038a835 637af9c0 6038776b 634ef32b 00000000 Call Trace: [<6001c86d>] show_stack+0xfe/0x15b [<6038a835>] dump_stack+0x2a/0x2c [<6038776b>] panic+0x13e/0x2b3 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff [<601cfbbe>] open_ctree+0x192d/0x27af [<6019c2c1>] btrfs_mount+0x8f5/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<6019bcb0>] btrfs_mount+0x2e4/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<600d710b>] do_mount+0xa35/0xbc9 [<600d7557>] SyS_mount+0x95/0xc8 [<6001e884>] handle_syscall+0x6b/0x8e Reported-by: Jiri Slaby <jslaby@suse.com> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> CC: stable@vger.kernel.org # 3.19+ Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: better packing of btrfs_delayed_extent_opDavid Sterba4-15/+12
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now and has 8 unused bytes. Reducing the level type to u8 makes it possible to squeeze it to the padding byte after key. The bitfields were switched to bool as there's space to store the full byte without increasing the whole structure, besides that the generated assembly is smaller. struct btrfs_delayed_extent_op { struct btrfs_disk_key key; /* 0 17 */ u8 level; /* 17 1 */ bool update_key; /* 18 1 */ bool update_flags; /* 19 1 */ bool is_data; /* 20 1 */ /* XXX 3 bytes hole, try to pack */ u64 flags_to_set; /* 24 8 */ /* size: 32, cachelines: 1, members: 6 */ /* sum members: 29, holes: 1, sum holes: 3 */ /* last cacheline: 32 bytes */ }; The final size is 32 bytes which gives +26 object per slab page. text data bss dec hex filename 938811 43670 23144 1005625 f5839 fs/btrfs/btrfs.ko.before 938747 43670 23144 1005561 f57f9 fs/btrfs/btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: put delayed item hook into inodeDavid Sterba2-31/+29
Inodes for delayed iput allocate a trivial helper structure, let's place the list hook directly into the inode and save a kmalloc (killing a __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode. The inode can be put into the delayed_iputs list more than once and we have to keep the count. This means we can't use the list_splice to process a bunch of inodes because we'd lost track of the count if the inode is put into the delayed iputs again while it's processed. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: Support convert to -d dup for btrfs-convertZhao Lei1-8/+0
Since we will add support for -d dup for non-mixed filesystem, kernel need to support converting to this raid-type. This patch remove limitation of above case. Tested by following script: (combination of dup conversion with fsck): export TEST_DEV='/dev/vdc' export TEST_DIR='/var/ltf/tester/mnt' do_dup_test() { local m_from="$1" local d_from="$2" local m_to="$3" local d_to="$4" echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to" umount "$TEST_DIR" &>/dev/null ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1 mount "$TEST_DEV" "$TEST_DIR" || return 1 cp -a /sbin/* "$TEST_DIR" [[ "$m_from" != "$m_to" ]] && { ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1 } [[ "$d_from" != "$d_to" ]] && { local opt=() [[ "$d_to" == single ]] && opt+=("-f") ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1 } umount "$TEST_DIR" || return 1 ./btrfsck "$TEST_DEV" || return 1 echo return 0 } test_all() { for m_from in single dup; do for d_from in single dup; do for m_to in single dup; do for d_to in single dup; do do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1 done done done done } test_all Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07Btrfs: igrab inode in writepageJosef Bacik1-2/+15
We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07Btrfs: add missing brelse when superblock checksum failsAnand Jain1-0/+1
Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-06Btrfs: fix transaction handle leak on failure to create hard linkFilipe Manana1-2/+4
If we failed to create a hard link we were not always releasing the the transaction handle we got before, resulting in a memory leak and preventing any other tasks from being able to commit the current transaction. Fix this by always releasing our transaction handle. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-31Btrfs: fix number of transaction units required to create symlinkFilipe Manana1-1/+3
We weren't accounting for the insertion of an inline extent item for the symlink inode nor that we need to update the parent inode item (through the call to btrfs_add_nondir()). So fix this by including two more transaction units. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-31Btrfs: don't leave dangling dentry if symlink creation failedFilipe Manana1-4/+7
When we are creating a symlink we might fail with an error after we created its inode and added the corresponding directory indexes to its parent inode. In this case we end up never removing the directory indexes because the inode eviction handler, called for our symlink inode on the final iput(), only removes items associated with the symlink inode and not with the parent inode. Example: $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt $ touch /mnt/foo $ ln -s /mnt/foo /mnt/bar ln: failed to create symbolic link ‘bar’: Cannot allocate memory $ umount /mnt $ btrfsck /dev/sdi Checking filesystem on /dev/sdi UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1 checking extents checking free space cache checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref found 131073 bytes used err is 1 total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124305 file data blocks allocated: 262144 referenced 262144 btrfs-progs v4.2.3 So fix this by adding the directory index entries as the very last step of symlink creation. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-31Btrfs: send, don't BUG_ON() when an empty symlink is foundFilipe Manana1-1/+15
When a symlink is successfully created it always has an inline extent containing the source path. However if an error happens when creating the symlink, we can leave in the subvolume's tree a symlink inode without any such inline extent item - this happens if after btrfs_symlink() calls btrfs_end_transaction() and before it calls the inode eviction handler (through the final iput() call), the transaction gets committed and a crash happens before the eviction handler gets called, or if a snapshot of the subvolume is made before the eviction handler gets called. Sadly we can't just avoid this by making btrfs_symlink() call btrfs_end_transaction() after it calls the eviction handler, because the later can commit the current transaction before it removes any items from the subvolume tree (if it encounters ENOSPC errors while reserving space for removing all the items). So make send fail more gracefully, with an -EIO error, and print a message to dmesg/syslog informing that there's an empty symlink inode, so that the user can delete the empty symlink or do something else about it. Reported-by: Stephen R. van den Berg <srb@cuci.nl> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-30Btrfs: fix race between free space endio workers and space cache writeoutFilipe Manana1-0/+19
While running a stress test I ran into the following trace/transaction abort: [471626.672243] ------------[ cut here ]------------ [471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]() [471626.675492] BTRFS: Transaction aborted (error -2) [471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix [471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [471626.691901] 0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38 [471626.695009] ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe [471626.697490] ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90 [471626.699201] Call Trace: [471626.699804] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [471626.701049] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [471626.702542] [<ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [471626.704326] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [471626.705636] [<ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs] [471626.707048] [<ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [471626.708616] [<ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs] [471626.709950] [<ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs] [471626.711286] [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31 [471626.712611] [<ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs] [471626.715610] [<ffffffff811962a2>] ? SyS_tee+0x226/0x226 [471626.716718] [<ffffffff811962c2>] sync_fs_one_sb+0x20/0x22 [471626.717672] [<ffffffff8116fc01>] iterate_supers+0x75/0xc2 [471626.718800] [<ffffffff8119669a>] sys_sync+0x52/0x80 [471626.719990] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [471626.721835] ---[ end trace baf57f43d76693f4 ]--- [471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry This is a very rare situation and it happened due to a race between a free space endio worker and writing the space caches for dirty block groups at a transaction's commit critical section. The steps leading to this are: 1) A task calls btrfs_commit_transaction() and starts the writeout of the space caches for all currently dirty block groups (i.e. it calls btrfs_start_dirty_block_groups()); 2) The previous step starts writeback for space caches; 3) When the writeback finishes it queues jobs for free space endio work queue (fs_info->endio_freespace_worker) that execute btrfs_finish_ordered_io(); 4) The task committing the transaction sets the transaction's state to TRANS_STATE_COMMIT_DOING and shortly after calls btrfs_write_dirty_block_groups(); 5) A free space endio job joins the transaction, through btrfs_join_transaction_nolock(), and updates a free space inode item in the root tree through btrfs_update_inode_fallback(); 6) Updating the free space inode item resulted in COWing one or more nodes/leaves of the root tree, and that resulted in creating a new metadata block group, which gets added to the transaction's list of dirty block groups (this is a very rare case); 7) The free space endio job has not released yet its transaction handle at this point, so the new metadata block group was not yet fully created (didn't go through btrfs_create_pending_block_groups() yet); 8) The transaction commit task sees the new metadata block group in the transaction's list of dirty block groups and processes it. When it attempts to update the block group's block group item in the extent tree, through write_one_cache_group(), it isn't able to find it and aborts the transaction with error -ENOENT - this is because the free space endio job hasn't yet released its transaction handle (which calls btrfs_create_pending_block_groups()) and therefore the block group item was not yet added to the extent tree. Fix this waiting for free space endio jobs if we fail to find a block group item in the extent tree and then retry once updating the block group item. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-30btrfs: don't run delayed references while we are creating the free space treeChris Mason4-16/+28
This is a short term solution to make sure btrfs_run_delayed_refs() doesn't change the extent tree while we are scanning it to create the free space tree. Longer term we need to synchronize scanning the block groups one by one, similar to what happens during a balance. Signed-off-by: Chris Mason <clm@fb.com>
2015-12-30btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled.Chris Mason1-0/+2
Merging in the free space tree deleted a variable needed when CONFIG_BTRFS_DEBUG=y Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23btrfs: fix warning on uninit variable in btrfs_finish_chunk_allocChris Mason1-1/+1
map->num_stripes really can't be zero, but just in case. Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23Merge branch 'freespace-4.5' into for-linus-4.5Chris Mason16-113/+2934
2015-12-23Merge branch 'for-chris-4.5' of ↵Chris Mason5-56/+151
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5
2015-12-23Merge branch 'dev/simplify-set-bit' of ↵Chris Mason8-136/+116
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23Merge branch 'dev/gfp-flags' of ↵Chris Mason4-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2015-12-23Merge branch 'cleanup/misc-simplify' of ↵Chris Mason9-47/+26
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2015-12-21Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groupsFilipe Manana1-2/+17
We call btrfs_write_dirty_block_groups() in the critical section of a transaction's commit, when no other tasks can join the transaction and add more block groups to the transaction's list of dirty block groups, so we not taking the dirty block groups spinlock when checking for the list's emptyness, grabbing its first element or deleting elements from it. However there's a special and rare case where we can have a concurrent task adding elements to this list. We trigger writeback for space caches before at btrfs_start_dirty_block_groups() and in past iterations of the loop at btrfs_write_dirty_block_groups(), this means that when the writeback finishes (which happens asynchronously) it creates a task for the endio free space work queue that executes btrfs_finish_ordered_io() - this function is able to join the transaction, through btrfs_join_transaction_nolock(), and update the free space cache's inode item in the root tree, which can result in COWing nodes of this tree and therefore allocation of a new block group can happen, which gets added to the transaction's list of dirty block groups while the transaction commit task is operating on it concurrently. So fix this by taking the dirty block groups spinlock before doing operations on the dirty block groups list at btrfs_write_dirty_block_groups(). Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-20Linux 4.4-rc6v4.4-rc6Linus Torvalds1-1/+1
2015-12-20Merge tag 'rtc-4.4-3' of ↵Linus Torvalds2-14/+53
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC fixes from Alexandre Belloni: "Late fixes for the RTC subsystem for 4.4: A fix for a nasty hardware bug in rk808 and an initialization reordering in da9063 to fix a possible crash" * tag 'rtc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: rtc: da9063: fix access ordering error during RTC interrupt at system power on rtc: rk808: Compensate for Rockchip calendar deviation on November 31st
2015-12-20rtc: da9063: fix access ordering error during RTC interrupt at system power onSteve Twiss1-10/+9
This fix alters the ordering of the IRQ and device registrations in the RTC driver probe function. This change will apply to the RTC driver that supports both DA9063 and DA9062 PMICs. A problem could occur with the existing RTC driver if: A system is started from a cold boot using the PMIC RTC IRQ to initiate a power on operation. For instance, if an RTC alarm is used to start a platform from power off. The existing driver IRQ is requested before the device has been properly registered. i.e. ret = devm_request_threaded_irq() comes before rtc->rtc_dev = devm_rtc_device_register(); In this case, the interrupt can be called before the device has been registered and the handler can be called immediately. The IRQ handler da9063_alarm_event() contains the function call rtc_update_irq(rtc->rtc_dev, 1, RTC_IRQF | RTC_AF); which in turn tries to access the unavailable rtc->rtc_dev. The fix is to reorder the functions inside the RTC probe. The IRQ is requested after the RTC device resource has been registered so that get_irq_byname is the last thing to happen. Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2015-12-20rtc: rk808: Compensate for Rockchip calendar deviation on November 31stJulius Werner1-4/+44
In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar insufficiently represented reality, and changed the rules about calculating leap years to account for this. Similarly, in A.D. 2013 Rockchip hardware engineers found that the new Gregorian calendar still contained flaws, and that the month of November should be counted up to 31 days instead. Unfortunately it takes a long time for calendar changes to gain widespread adoption, and just like more than 300 years went by before the last Protestant nation implemented Greg's proposal, we will have to wait a while until all religions and operating system kernels acknowledge the inherent advantages of the Rockchip system. Until then we need to translate dates read from (and written to) Rockchip hardware back to the Gregorian format. This patch works by defining Jan 1st, 2016 as the arbitrary anchor date on which Rockchip and Gregorian calendars are in sync. From that we can translate arbitrary later dates back and forth by counting the number of November/December transitons since the anchor date to determine the offset between the calendars. We choose this method (rather than trying to regularly "correct" the date stored in hardware) since it's the only way to ensure perfect time-keeping even if the system may be shut down for an unknown number of years. The drawback is that other software reading the same hardware (e.g. mainboard firmware) must use the same translation convention (including the same anchor date) to be able to read and write correct timestamps from/to the RTC. Signed-off-by: Julius Werner <jwerner@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2015-12-19Merge tag 'tty-4.4-rc6' of ↵Linus Torvalds5-17/+19
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are some tty/serial driver fixes for 4.4-rc6 that resolve some reported problems. All of these have been in linux-next. The details are in the shortlog" * tag 'tty-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: tty: Fix GPF in flush_to_ldisc() serial: earlycon: Add missing spinlock initialization serial: sh-sci: Fix length of scatterlist n_tty: Fix poll() after buffer-limited eof push read serial: 8250_uniphier: fix dl_read and dl_write functions
2015-12-19Merge tag 'usb-4.4-rc6' of ↵Linus Torvalds11-47/+122
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some USB and PHY fixes for 4.4-rc6. All of them resolve some reported problems. Full details in the shortlog" * tag 'usb-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: USB: fix invalid memory access in hub_activate() USB: ipaq.c: fix a timeout loop phy: core: Get a refcount to phy in devm_of_phy_get_by_index() phy: cygnus: pcie: add missing of_node_put phy: miphy365x: add missing of_node_put phy: miphy28lp: add missing of_node_put phy: rockchip-usb: add missing of_node_put phy: berlin-sata: add missing of_node_put phy: mt65xx-usb3: add missing of_node_put phy: brcmstb-sata: add missing of_node_put phy: sun9i-usb: add USB dependency
2015-12-19Merge tag 'md/4.4-rc5-fixes' of git://neil.brown.name/mdLinus Torvalds3-10/+24
Pull md fixes from Neil Brown: "Four fixes for md: - two recently introduced regressions fixed. - one older bug in RAID10 - tagged for -stable since 4.2 - one minor sysfs api improvement" * tag 'md/4.4-rc5-fixes' of git://neil.brown.name/md: Fix remove_and_add_spares removes drive added as spare in slot_store md: fix bug due to nested suspend MD: change journal disk role to disk 0 md/raid10: fix data corruption and crash during resync
2015-12-19Merge tag 'powerpc-4.4-5' of ↵Linus Torvalds4-26/+26
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Partial revert of "powerpc: Individual System V IPC system calls" - pr_warn_once on unsupported OPAL_MSG type from Stewart - Fix deadlock in opal-irqchip introduced by "Fix double endian conversion" from Alistair * tag 'powerpc-4.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/opal-irqchip: Fix deadlock introduced by "Fix double endian conversion" powerpc/powernv: pr_warn_once on unsupported OPAL_MSG type Partial revert of "powerpc: Individual System V IPC system calls"
2015-12-19Merge tag 'spi-fix-v4.4-rc5' of ↵Linus Torvalds3-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A couple of reference counting bugs here, one in spidev and one with holding an extra reference in the core that we never freed if we removed a device, plus a driver specific fix. Both of the refcounting bugs are very old but they've only been found by observation so hopefully their impact has been low" * tag 'spi-fix-v4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: fix parent-device reference leak spi: spidev: Hold spi_lock over all defererences of spi in release() spi-fsl-dspi: Fix CTAR Register access
2015-12-19Merge tag 'gpio-v4.4-3' of ↵Linus Torvalds3-4/+10
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: "Some GPIO fixes for the v4.4 series. Most prominent: I revert the error propagation from the .get() function until we can fix up all the drivers properly for v4.5. - Revert the error number propagation from the .get() vtable entry temporarily, until we make the proper fixes to all drivers. - Fix the clamping behaviour in the generic GPIO driver. - Driver fix for the ath79 driver" * tag 'gpio-v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio: revert get() to non-errorprogating behaviour gpio: generic: clamp values from bgpio_get_set() gpio: ath79: Fix the logic to clear offset bit of AR71XX_GPIO_REG_OE register
2015-12-19Merge tag 'pinctrl-v4.4-3' of ↵Linus Torvalds7-28/+41
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: - Driver fixes for Freescale i.MX7D, Intel, Broadcom 2835 - One MAINTAINERS entry * tag 'pinctrl-v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: MAINTAINERS: pinctrl: Add maintainers for pinctrl-single pinctrl: bcm2835: Fix initial value for direction_output pinctrl: intel: fix offset calculation issue of register PAD_OWN pinctrl: intel: fix bug of register offset calculation pinctrl: freescale: add ZERO_OFFSET_VALID flag for vf610 pinctrl
2015-12-19Merge branch 'i2c/for-current' of ↵Linus Torvalds9-23/+50
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "A set of 'usual' driver bugfixes for the I2C subsystem" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: rcar: disable runtime PM correctly in slave mode i2c: designware: Keep pm_runtime_enable/_disable calls in sync i2c: designware: fix IO timeout issue for AMD controller i2c: imx: init bus recovery info before adding i2c adapter i2c: do not use 0x in front of %pa i2c: davinci: Increase module clock frequency i2c: mv64xxx: The n clockdiv factor is 0 based on sunxi SoCs i2c: rk3x: populate correct variable for sda_falling_time
2015-12-19Merge branch 'for-linus' of ↵Linus Torvalds11-12/+65
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: "Just a few assorted driver fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: elants_i2c - fix wake-on-touch Input: elan_i2c - set input device's vendor and product IDs Input: sun4i-lradc-keys - fix typo in binding documentation Input: atmel_mxt_ts - add maxtouch to I2C table for module autoload Input: arizona-haptic - fix disabling of haptics device Input: aiptek - fix crash on detecting device without endpoints Input: atmel_mxt_ts - add generic platform data for Chromebooks Input: parkbd - clear unused function pointers Input: walkera0701 - clear unused function pointers Input: turbografx - clear unused function pointers Input: gamecon - clear unused function pointers Input: db9 - clear unused function pointers
2015-12-19i2c: rcar: disable runtime PM correctly in slave modeWolfram Sang1-2/+2
When we also are I2C slave, we need to disable runtime PM because the address detection mechanism needs to be active all the time. However, we can reenable runtime PM once the slave instance was unregistered. So, use pm_runtime_get_sync/put to achieve this, since it has proper refcounting. pm_runtime_allow/forbid is like a global knob controllable from userspace which is unsuitable here. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2015-12-18Merge tag 'pm+acpi-4.4-rc6' of ↵Linus Torvalds4-15/+29
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a potential regression introduced during the 4.3 cycle (generic power domains framework), a nasty bug that has been present forever (power capping RAPL driver), a build issue (Tegra cpufreq driver) and a minor ugliness introduced recently (intel_pstate). Specifics: - Fix a potential regression in the generic power domains framework introduced during the 4.3 development cycle that may lead to spurious failures of system suspend in certain situations (Ulf Hansson). - Fix a problem in the power capping RAPL (Running Average Power Limits) driver that causes it to initialize successfully on some systems where it is not supposed to do that which is due to an incorrect check in an initialization routine (Prarit Bhargava). - Fix a build problem in the cpufreq Tegra driver that depends on the regulator framework, but that dependency is not reflected in Kconfig (Arnd Bergmann). - Fix a recent mistake in the intel_pstate driver where a numeric constant is used directly instead of a symbol defined specifically for the case in question (Prarit Bhargava)" * tag 'pm+acpi-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: powercap / RAPL: fix BIOS lock check cpufreq: intel_pstate: Minor cleanup for FRAC_BITS cpufreq: tegra: add regulator dependency for T124 PM / Domains: Allow runtime PM callbacks to be re-used during system PM
2015-12-18Merge tag 'scsi-fixes' of ↵Linus Torvalds3-12/+42
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Three fixes this time, two in SES picked up by KASAN for various types of buffer overrun. The first is a USB array which returns page 8 whatever is asked for and causes us to overrun with incorrect data format assumptions and the second is an invalid iteration of page 10 (the additional information page). The final fix is a reversion of a NULL deref fix which caused suspend/resume not to be called in pairs leading to incorrect device operation (Jens has queued a more proper fix for the problem in block)" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: ses: fix additional element traversal bug Revert "SCSI: Fix NULL pointer dereference in runtime PM" ses: Fix problems with simple enclosures
2015-12-18Input: elants_i2c - fix wake-on-touchJames Chen1-9/+12
When sending "SLEEP" command to the controller it ceases scanning completely and is unable to wake the system up from sleep, so if it is configured as a wakeup source we should simply configure interrupt for wakeup and rely on idle logic within the controller to reduce power consumption while it is not used. Signed-off-by: James Chen <james.chen@emc.com.tw> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2015-12-18Merge tag 'media/v4.4-3' of ↵Linus Torvalds3-4/+15
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab. * tag 'media/v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: [media] airspy: increase USB control message buffer size [media] hackrf: move RF gain ctrl enable behind module parameter [media] hackrf: fix possible null ptr on debug printing [media] Revert "[media] ivtv: avoid going past input/audio array"
2015-12-18Merge branch 'for-linus-4.4' of ↵Linus Torvalds6-15/+29
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "A couple of small fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: check prepare_uptodate_page() error code earlier Btrfs: check for empty bitmap list in setup_cluster_bitmaps btrfs: fix misleading warning when space cache failed to load Btrfs: fix transaction handle leak in balance Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
2015-12-18Merge branch 'akpm' (patches from Andrew)Linus Torvalds3-3/+5
Merge misc fixes from Andrew Morton: "Three patches" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: include/linux/mmdebug.h: should include linux/bug.h mm/zswap: change incorrect strncmp use to strcmp proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter
2015-12-18include/linux/mmdebug.h: should include linux/bug.hJames Morse1-0/+1
mmdebug.h uses BUILD_BUG_ON_INVALID(), assuming someone else included linux/bug.h. Include it ourselves. This saves build-failures such as: arch/arm64/include/asm/pgtable.h: In function 'set_pte_at': arch/arm64/include/asm/pgtable.h:281:3: error: implicit declaration of function 'BUILD_BUG_ON_INVALID' [-Werror=implicit-function-declaration] VM_WARN_ONCE(!pte_young(pte), Fixes: 02602a18c32d7 ("bug: completely remove code generated by disabled VM_BUG_ON()") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-18mm/zswap: change incorrect strncmp use to strcmpDan Streetman1-3/+3
Change the use of strncmp in zswap_pool_find_get() to strcmp. The use of strncmp is no longer correct, now that zswap_zpool_type is not an array; sizeof() will return the size of a pointer, which isn't the right length to compare. We don't need to use strncmp anyway, because the existing params and the passed in params are all guaranteed to be null terminated, so strcmp should be used. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reported-by: Weijie Yang <weijie.yang@samsung.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-18proc: fix -ESRCH error when writing to /proc/$pid/coredump_filterColin Ian King1-0/+1
Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed the setting of ret after the get_proc_task call and incorrectly left it as -ESRCH. Instead, return 0 when successful. Example breakage: echo 0 > /proc/self/coredump_filter bash: echo: write error: No such process Fixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-18Merge tag 'hwmon-for-linus-v4.4-rc6' of ↵Linus Torvalds2-1/+16
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon fixes from Guenter Roeck: - Select CONFIG_BITREVERSE for sht15 driver to avoid build failure if it is not configured. - Force wait for conversion time for the first valid data in tmp102 driver to avoid reporting erroneous data to the thermal subsystem. * tag 'hwmon-for-linus-v4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: (sht15) Select CONFIG_BITREVERSE hwmon: (tmp102) Force wait for conversion time for the first valid data
2015-12-18Merge tag 'iommu-fixes-v4.4-rc5' of ↵Linus Torvalds2-2/+38
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU fixes from Joerg Roedel: "Two similar fixes for the Intel and AMD IOMMU drivers to add proper access checks before calling handle_mm_fault" * tag 'iommu-fixes-v4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Do access checks before calling handle_mm_fault() iommu/amd: Do proper access checking before calling handle_mm_fault()
2015-12-18Merge tag 'for-linus-4.4-rc5-tag' of ↵Linus Torvalds11-67/+138
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen bug fixes from David Vrabel: - XSA-155 security fixes to backend drivers. - XSA-157 security fixes to pciback. * tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen-pciback: fix up cleanup path when alloc fails xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set. xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled. xen/pciback: Do not install an IRQ handler for MSI interrupts. xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled xen/pciback: Save xen_pci_op commands before processing it xen-scsiback: safely copy requests xen-blkback: read from indirect descriptors only once xen-blkback: only read request operation from shared ring once xen-netback: use RING_COPY_REQUEST() throughout xen-netback: don't use last request to determine minimum Tx credit xen: Add RING_COPY_REQUEST() xen/x86/pvh: Use HVM's flush_tlb_others op xen: Resume PMU from non-atomic context xen/events/fifo: Consume unprocessed events when a CPU dies
2015-12-18Merge tag 'arc-fixes-for-4.4-rc6' of ↵Linus Torvalds14-68/+97
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC architecture fixes from Vineet Gupta: "Fixes for: - perf interrupts on SMP: Not enabled (at boot) and disabled (at runtime) - stack unwinder regression (for modules, ignoring dwarf3) - nsim hosed for non default kernel link base builds" * tag 'arc-fixes-for-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: smp: Rename platform hook @init_cpu_smp -> @init_per_cpu ARC: rename smp operation init_irq_cpu() to init_per_cpu() ARC: dw2 unwind: Ignore CIE version !=1 gracefully instead of bailing ARC: dw2 unwind: Reinstante unwinding out of modules ARC: [plat-sim] unbork non default CONFIG_LINUX_LINK_BASE ARC: intc: Document arc_request_percpu_irq() better ARCv2: perf: Ensure perf intr gets enabled on all cores ARC: intc: No need to clear IRQ_NOAUTOEN ARCv2: intc: Fix random perf irq disabling in SMP setup ARC: [axs10x] cap ethernet phy to 100 Mbit/sec
2015-12-18Merge tag 'sound-4.4-rc6' of ↵Linus Torvalds6-31/+81
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "As usual in rc6, this update contains only a few HD-audio and USB-audio device-specific quirks: yet another Thinkpad noise fixes, Dell headphone mic fixes, and AudioQuest DragonFly fixes" * tag 'sound-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: hda - Add a fixup for Thinkpad X1 Carbon 2nd ALSA: hda - Set codec to D3 at reboot/shutdown on Thinkpads ALSA: hda - Apply click noise workaround for Thinkpads generically ALSA: hda - Fix headphone mic input on a few Dell ALC293 machines ALSA: usb-audio: Add sample rate inquiry quirk for AudioQuest DragonFly ALSA: usb-audio: Add a more accurate volume quirk for AudioQuest DragonFly
2015-12-18Merge tag 'for-linus-20151217' of git://git.infradead.org/linux-mtdLinus Torvalds2-3/+16
Pull MTD fixes from Brian Norris: "I was holding out on this pull request for a bit, since there are a few other small issues being discussed that look like 4.4-rc regressions. Hopefully I can get those stabilized soon, but these are ready at any rate: - A little bit of a last-minute change for the device tree "fixed partition" binding. This is needed because we might want to reuse the 'partitions' subnode for other sorts of partitioning descriptions -- e.g., for describing which on-flash partition format(s) might be used on the system. - Also tone down a warning message, since it is probably going to show up on a lot of systems where it should just be ignored" * tag 'for-linus-20151217' of git://git.infradead.org/linux-mtd: doc: dt: mtd: partitions: add compatible property to "partitions" node mtd: ofpart: don't complain about missing 'partitions' node too loudly
2015-12-18Merge branch 'freespace-tree' into for-linus-4.5Chris Mason16-113/+2934
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-18USB: fix invalid memory access in hub_activate()Alan Stern1-3/+19
Commit 8520f38099cc ("USB: change hub initialization sleeps to delayed_work") changed the hub_activate() routine to make part of it run in a workqueue. However, the commit failed to take a reference to the usb_hub structure or to lock the hub interface while doing so. As a result, if a hub is plugged in and quickly unplugged before the work routine can run, the routine will try to access memory that has been deallocated. Or, if the hub is unplugged while the routine is running, the memory may be deallocated while it is in active use. This patch fixes the problem by taking a reference to the usb_hub at the start of hub_activate() and releasing it at the end (when the work is finished), and by locking the hub interface while the work routine is running. It also adds a check at the start of the routine to see if the hub has already been disconnected, in which nothing should be done. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Alexandru Cornea <alexandru.cornea@intel.com> Tested-by: Alexandru Cornea <alexandru.cornea@intel.com> Fixes: 8520f38099cc ("USB: change hub initialization sleeps to delayed_work") CC: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-18USB: ipaq.c: fix a timeout loopDan Carpenter1-1/+2
The code expects the loop to end with "retries" set to zero but, because it is a post-op, it will end set to -1. I have fixed this by moving the decrement inside the loop. Fixes: 014aa2a3c32e ('USB: ipaq: minor ipaq_open() cleanup.') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-18[media] airspy: increase USB control message buffer sizeAntti Palosaari1-1/+1
Driver requested device firmware version string during probe using only 24 byte long buffer. That buffer is too small for newer firmware versions, which causes device firmware hang - device stops responding to any commands after that. Increase buffer size to 128 which should be enough for any current and future version strings. Link: https://github.com/airspy/host/issues/27 Cc: <stable@vger.kernel.org> # 3.17+ Reported-by: Benjamin Vernoux <bvernoux@gmail.com> Signed-off-by: Antti Palosaari <crope@iki.fi> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2015-12-18[media] hackrf: move RF gain ctrl enable behind module parameterAntti Palosaari1-0/+11
Used Avago MGA-81563 RF amplifier could be destroyed pretty easily with too strong signal or transmitting to bad antenna. Add module parameter 'enable_rf_gain_ctrl' which allows enabling RF gain control - otherwise, default without the module parameter, RF gain control is set to 'grabbed' state which prevents setting value to the control. Signed-off-by: Antti Palosaari <crope@iki.fi> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2015-12-18[media] hackrf: fix possible null ptr on debug printingAntti Palosaari1-1/+1
drivers/media/usb/hackrf/hackrf.c:1533 hackrf_probe() error: we previously assumed 'dev' could be null (see line 1366) Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Antti Palosaari <crope@iki.fi> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2015-12-18[media] Revert "[media] ivtv: avoid going past input/audio array"Mauro Carvalho Chehab1-2/+2
This patch broke ivtv logic, as reported at https://bugzilla.redhat.com/show_bug.cgi?id=1278942 This reverts commit 09290cc885937cab3b2d60a6d48fe3d2d3e04061. Cc: stable@vger.kernel.org # for v4.1 and upper Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2015-12-18Merge tag 'phy-for-4.4-rc' of ↵Greg Kroah-Hartman9-43/+101
git://git.kernel.org/pub/scm/linux/kernel/git/kishon/linux-phy into usb-linus Kishon writes: phy: for 4.4 -rc *) Add missing of_node_put in a bunch of PHY drivers *) Add get_device in devm_of_phy_get_by_index() *) Fix randconfig build error in sun9i usb driver
2015-12-18xen-pciback: fix up cleanup path when alloc failsDoug Goldstein1-1/+3
When allocating a pciback device fails, clear the private field. This could lead to an use-after free, however the 'really_probe' takes care of setting dev_set_drvdata(dev, NULL) in its failure path (which we would exercise if the ->probe function failed), so we we are OK. However lets be defensive as the code can change. Going forward we should clean up the pci_set_drvdata(dev, NULL) in the various code-base. That will be for another day. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reported-by: Jonathan Creekmore <jonathan.creekmore@gmail.com> Signed-off-by: Doug Goldstein <cardoe@cardoe.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18hwmon: (sht15) Select CONFIG_BITREVERSEArnd Bergmann1-0/+1
If CONFIG_BITREVERSE is not built-in, the sht15 driver fails to link: drivers/built-in.o: In function `sht15_crc8': drivers/hwmon/sht15.c:195: undefined reference to `byte_rev_table' This adds a Kconfig 'select' statement, like all other users of bitrev.h have it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 33836ee98533 ("hwmon:change sht15_reverse()") Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2015-12-18xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.Konrad Rzeszutek Wilk1-1/+7
commit f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way") teaches us that dealing with MSI-X can be troublesome. Further checks in the MSI-X architecture shows that if the PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we may not be able to access the BAR (since they are memory regions). Since the MSI-X tables are located in there.. that can lead to us causing PCIe errors. Inhibit us performing any operation on the MSI-X unless the MEMORY bit is set. Note that Xen hypervisor with: "x86/MSI-X: access MSI-X table only after having enabled MSI-X" will return: xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3! When the generic MSI code tries to setup the PIRQ without MEMORY bit set. Which means with later versions of Xen (4.6) this patch is not neccessary. This is part of XSA-157 CC: stable@vger.kernel.org Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has ↵Konrad Rzeszutek Wilk1-13/+20
MSI(X) enabled. Otherwise just continue on, returning the same values as previously (return of 0, and op->result has the PIRQ value). This does not change the behavior of XEN_PCI_OP_disable_msi[|x]. The pci_disable_msi or pci_disable_msix have the checks for msi_enabled or msix_enabled so they will error out immediately. However the guest can still call these operations and cause us to disable the 'ack_intr'. That means the backend IRQ handler for the legacy interrupt will not respond to interrupts anymore. This will lead to (if the device is causing an interrupt storm) for the Linux generic code to disable the interrupt line. Naturally this will only happen if the device in question is plugged in on the motherboard on shared level interrupt GSI. This is part of XSA-157 CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen/pciback: Do not install an IRQ handler for MSI interrupts.Konrad Rzeszutek Wilk1-0/+7
Otherwise an guest can subvert the generic MSI code to trigger an BUG_ON condition during MSI interrupt freeing: for (i = 0; i < entry->nvec_used; i++) BUG_ON(irq_has_action(entry->irq + i)); Xen PCI backed installs an IRQ handler (request_irq) for the dev->irq whenever the guest writes PCI_COMMAND_MEMORY (or PCI_COMMAND_IO) to the PCI_COMMAND register. This is done in case the device has legacy interrupts the GSI line is shared by the backend devices. To subvert the backend the guest needs to make the backend to change the dev->irq from the GSI to the MSI interrupt line, make the backend allocate an interrupt handler, and then command the backend to free the MSI interrupt and hit the BUG_ON. Since the backend only calls 'request_irq' when the guest writes to the PCI_COMMAND register the guest needs to call XEN_PCI_OP_enable_msi before any other operation. This will cause the generic MSI code to setup an MSI entry and populate dev->irq with the new PIRQ value. Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY and cause the backend to setup an IRQ handler for dev->irq (which instead of the GSI value has the MSI pirq). See 'xen_pcibk_control_isr'. Then the guest disables the MSI: XEN_PCI_OP_disable_msi which ends up triggering the BUG_ON condition in 'free_msi_irqs' as there is an IRQ handler for the entry->irq (dev->irq). Note that this cannot be done using MSI-X as the generic code does not over-write dev->irq with the MSI-X PIRQ values. The patch inhibits setting up the IRQ handler if MSI or MSI-X (for symmetry reasons) code had been called successfully. P.S. Xen PCIBack when it sets up the device for the guest consumption ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device). XSA-120 addendum patch removed that - however when upstreaming said addendum we found that it caused issues with qemu upstream. That has now been fixed in qemu upstream. This is part of XSA-157 CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or ↵Konrad Rzeszutek Wilk1-0/+7
MSI-X enabled The guest sequence of: a) XEN_PCI_OP_enable_msix b) XEN_PCI_OP_enable_msix results in hitting an NULL pointer due to using freed pointers. The device passed in the guest MUST have MSI-X capability. The a) constructs and SysFS representation of MSI and MSI groups. The b) adds a second set of them but adding in to SysFS fails (duplicate entry). 'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that in a) pdev->msi_irq_groups is still set) and also free's ALL of the MSI-X entries of the device (the ones allocated in step a) and b)). The unwind code: 'free_msi_irqs' deletes all the entries and tries to delete the pdev->msi_irq_groups (which hasn't been set to NULL). However the pointers in the SysFS are already freed and we hit an NULL pointer further on when 'strlen' is attempted on a freed pointer. The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard against that. The check for msi_enabled is not stricly neccessary. This is part of XSA-157 CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or ↵Konrad Rzeszutek Wilk1-1/+6
MSI-X enabled The guest sequence of: a) XEN_PCI_OP_enable_msi b) XEN_PCI_OP_enable_msi c) XEN_PCI_OP_disable_msi results in hitting an BUG_ON condition in the msi.c code. The MSI code uses an dev->msi_list to which it adds MSI entries. Under the above conditions an BUG_ON() can be hit. The device passed in the guest MUST have MSI capability. The a) adds the entry to the dev->msi_list and sets msi_enabled. The b) adds a second entry but adding in to SysFS fails (duplicate entry) and deletes all of the entries from msi_list and returns (with msi_enabled is still set). c) pci_disable_msi passes the msi_enabled checks and hits: BUG_ON(list_empty(dev_to_msi_list(&dev->dev))); and blows up. The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard against that. The check for msix_enabled is not stricly neccessary. This is part of XSA-157. CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen/pciback: Save xen_pci_op commands before processing itKonrad Rzeszutek Wilk2-1/+15
Double fetch vulnerabilities that happen when a variable is fetched twice from shared memory but a security check is only performed the first time. The xen_pcibk_do_op function performs a switch statements on the op->cmd value which is stored in shared memory. Interestingly this can result in a double fetch vulnerability depending on the performed compiler optimization. This patch fixes it by saving the xen_pci_op command before processing it. We also use 'barrier' to make sure that the compiler does not perform any optimization. This is part of XSA155. CC: stable@vger.kernel.org Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen-scsiback: safely copy requestsDavid Vrabel1-1/+1
The copy of the ring request was lacking a following barrier(), potentially allowing the compiler to optimize the copy away. Use RING_COPY_REQUEST() to ensure the request is copied to local memory. This is part of XSA155. CC: stable@vger.kernel.org Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen-blkback: read from indirect descriptors only onceRoger Pau Monné1-5/+10
Since indirect descriptors are in memory shared with the frontend, the frontend could alter the first_sect and last_sect values after they have been validated but before they are recorded in the request. This may result in I/O requests that overflow the foreign page, possibly overwriting local pages when the I/O request is executed. When parsing indirect descriptors, only read first_sect and last_sect once. This is part of XSA155. CC: stable@vger.kernel.org Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen-blkback: only read request operation from shared ring onceRoger Pau Monné1-4/+4
A compiler may load a switch statement value multiple times, which could be bad when the value is in memory shared with the frontend. When converting a non-native request to a native one, ensure that src->operation is only loaded once by using READ_ONCE(). This is part of XSA155. CC: stable@vger.kernel.org Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen-netback: use RING_COPY_REQUEST() throughoutDavid Vrabel1-16/+14
Instead of open-coding memcpy()s and directly accessing Tx and Rx requests, use the new RING_COPY_REQUEST() that ensures the local copy is correct. This is more than is strictly necessary for guest Rx requests since only the id and gref fields are used and it is harmless if the frontend modifies these. This is part of XSA155. CC: stable@vger.kernel.org Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen-netback: don't use last request to determine minimum Tx creditDavid Vrabel1-3/+1
The last from guest transmitted request gives no indication about the minimum amount of credit that the guest might need to send a packet since the last packet might have been a small one. Instead allow for the worst case 128 KiB packet. This is part of XSA155. CC: stable@vger.kernel.org Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18xen: Add RING_COPY_REQUEST()David Vrabel1-0/+14
Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly (i.e., by not considering that the other end may alter the data in the shared ring while it is being inspected). Safe usage of a request generally requires taking a local copy. Provide a RING_COPY_REQUEST() macro to use instead of RING_GET_REQUEST() and an open-coded memcpy(). This takes care of ensuring that the copy is done correctly regardless of any possible compiler optimizations. Use a volatile source to prevent the compiler from reordering or omitting the copy. This is part of XSA155. CC: stable@vger.kernel.org Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2015-12-18powerpc/opal-irqchip: Fix deadlock introduced by "Fix double endian conversion"Alistair Popple1-1/+13
Commit 25642e1459ac ("powerpc/opal-irqchip: Fix double endian conversion") fixed an endian bug by calling opal_handle_events() in opal_event_unmask(). However this introduced a deadlock if we find an event is active during unmasking and call opal_handle_events() again. The bad call sequence is: opal_interrupt() -> opal_handle_events() -> generic_handle_irq() -> handle_level_irq() -> raw_spin_lock(&desc->lock) handle_irq_event(desc) unmask_irq(desc) -> opal_event_unmask() -> opal_handle_events() -> generic_handle_irq() -> handle_level_irq() -> raw_spin_lock(&desc->lock) (BOOM) When generating multiple opal events in quick succession this would lead to the following stall warnings: EEH: Fenced PHB#0 detected, location: U78C9.001.WZS09XA-P1-C32 INFO: rcu_sched detected stalls on CPUs/tasks: 12-...: (1 GPs behind) idle=68f/140000000000001/0 softirq=860/861 fqs=2065 15-...: (1 GPs behind) idle=be5/140000000000001/0 softirq=1142/1143 fqs=2065 (detected by 13, t=2102 jiffies, g=1325, c=1324, q=602) NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [irqbalance:2696] INFO: rcu_sched detected stalls on CPUs/tasks: 12-...: (1 GPs behind) idle=68f/140000000000001/0 softirq=860/861 fqs=8371 15-...: (1 GPs behind) idle=be5/140000000000001/0 softirq=1142/1143 fqs=8371 (detected by 20, t=8407 jiffies, g=1325, c=1324, q=1290) This patch corrects the problem by queuing the work if an event is active during unmasking, which is similar to the pre-endian fix behaviour. Fixes: 25642e1459ac ("powerpc/opal-irqchip: Fix double endian conversion") Signed-off-by: Alistair Popple <alistair@popple.id.au> Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-18Fix remove_and_add_spares removes drive added as spare in slot_storeGoldwyn Rodrigues1-3/+10
Commit 2910ff17d154baa5eb50e362a91104e831eb2bb6 introduced a regression which would remove a recently added spare via slot_store. Revert part of the patch which touches slot_store() and add the disk directly using pers->hot_add_disk() Fixes: 2910ff17d154 ("md: remove_and_add_spares() to activate specific rdev") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18md: fix bug due to nested suspendMikulas Patocka1-3/+4
The patch c7bfced9a6716ff66c9d61f934bb60af08d4688c committed to 4.4-rc causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with this BUG, the reason is that we attempt to suspend a device that is already suspended. See also https://bugzilla.redhat.com/show_bug.cgi?id=1283491 This patch fixes the bug by changing functions mddev_suspend and mddev_resume to always nest. The number of nested calls to mddev_nested_suspend is kept in the variable mddev->suspended. [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend] kernel BUG at drivers/md/md.c:317! CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1 task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001000000000000001111 Not tainted r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810 r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810 r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4 r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001 r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1 r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000 r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001 sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390 IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748 CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000 ORIG_R28: 00000000000000b0 IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod] IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod] RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1] Backtrace: [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1] [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid] [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod] [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod] [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod] [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod] [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod] [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558 Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18MD: change journal disk role to disk 0Shaohua Li2-3/+7
Neil pointed out setting journal disk role to raid_disks will confuse reshape if we support reshape eventually. Switching the role to 0 (we should be fine as long as the value >=0) and skip sysfs file creation to avoid error. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18md/raid10: fix data corruption and crash during resyncArtur Paszkiewicz1-1/+3
The commit c31df25f20e3 ("md/raid10: make sync_request_write() call bio_copy_data()") replaced manual data copying with bio_copy_data() but it doesn't work as intended. The source bio (fbio) is already processed, so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt. Because of this, bio_copy_data() either does not copy anything, or worse, copies data from the ->bi_next bio if it is set. This causes wrong data to be written to drives during resync and sometimes lockups/crashes in bio_copy_data(): [ 517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319] [ 517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1 [ 517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015 [ 517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000 [ 517.468529] RIP: 0010:[<ffffffff812e1888>] [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0 [ 517.478164] RSP: 0018:ffff880150dfbc98 EFLAGS: 00000246 [ 517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000 [ 517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0 [ 517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980 [ 517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000 [ 517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000 [ 517.525412] FS: 0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000 [ 517.534844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0 [ 517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 517.566144] Stack: [ 517.568626] ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000 [ 517.577659] 0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800 [ 517.586715] ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000 [ 517.595773] Call Trace: [ 517.598747] [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10] [ 517.605610] [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2 [ 517.611987] [<ffffffff814ff206>] md_thread+0x106/0x140 [ 517.618072] [<ffffffff810c1d80>] ? wait_woken+0x80/0x80 [ 517.624252] [<ffffffff814ff100>] ? super_1_load+0x520/0x520 [ 517.630817] [<ffffffff8109ef89>] kthread+0xc9/0xe0 [ 517.636506] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 [ 517.643653] [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70 [ 517.649929] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: Shaohua Li <shli@kernel.org> Cc: stable@vger.kernel.org (v4.2+) Fixes: c31df25f20e3 ("md/raid10: make sync_request_write() call bio_copy_data()") Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18Btrfs: fix locking bugs when defragging leavesFilipe Manana1-3/+24
When running fstests btrfs/070, with a higher number of fsstress operations, I ran frequently into two different locking bugs when defragging directories. The first bug produced the following traces: [133860.229792] ------------[ cut here ]------------ [133860.251062] WARNING: CPU: 2 PID: 26057 at fs/btrfs/locking.c:46 btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]() [133860.253576] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport [133860.282566] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [133860.284393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [133860.286827] 0000000000000000 ffff880207697b78 ffffffff812566f4 0000000000000000 [133860.288341] ffff880207697bb0 ffffffff8104d0a6 ffffffffa052d4c1 ffff880178f60e00 [133860.294219] ffff880178f60e00 0000000000000000 00000000000000f6 ffff880207697bc0 [133860.295831] Call Trace: [133860.306518] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [133860.307473] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [133860.308619] [<ffffffffa052d4c1>] ? btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs] [133860.310068] [<ffffffff8104d172>] warn_slowpath_null+0x1a/0x1c [133860.312552] [<ffffffffa052d4c1>] btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs] [133860.314630] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs] [133860.323596] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs] [133860.325233] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs] [133860.332427] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs] [133860.337259] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs] [133860.340147] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs] [133860.344833] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [133860.346343] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.353248] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.354242] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [133860.355232] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [133860.356237] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [133860.358587] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [133860.360195] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [133860.361380] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [133860.363578] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [133860.366217] ---[ end trace 2cadb2f653437e49 ]--- [133860.367399] ------------[ cut here ]------------ [133860.368162] kernel BUG at fs/btrfs/locking.c:307! [133860.369430] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [133860.370205] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport [133860.370205] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [133860.370205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [133860.370205] task: ffff8800aec6db40 ti: ffff880207694000 task.ti: ffff880207694000 [133860.370205] RIP: 0010:[<ffffffffa052d466>] [<ffffffffa052d466>] btrfs_assert_tree_locked+0x10/0x14 [btrfs] [133860.370205] RSP: 0018:ffff880207697bc0 EFLAGS: 00010246 [133860.370205] RAX: 0000000000000000 RBX: ffff880178f60e00 RCX: 0000000000000000 [133860.370205] RDX: ffff88023ec4fb50 RSI: 00000000ffffffff RDI: ffff880178f60e00 [133860.370205] RBP: ffff880207697bc0 R08: 0000000000000001 R09: 0000000000000000 [133860.370205] R10: 0000160000000000 R11: ffffffff81651000 R12: ffff880178f60e00 [133860.370205] R13: 0000000000000000 R14: 00000000000000f6 R15: ffff8801ff409000 [133860.370205] FS: 00007f763efd48c0(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [133860.370205] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [133860.370205] CR2: 0000000002158048 CR3: 000000003fd6c000 CR4: 00000000000006e0 [133860.370205] Stack: [133860.370205] ffff880207697bd8 ffffffffa052d4d0 0000000000000000 ffff880207697be8 [133860.370205] ffffffffa04d5787 ffff880207697c80 ffffffffa04d99cb ffff8801ff409590 [133860.370205] ffff880207697ca8 000000f507697c80 ffff880183c11bb8 0000000000000000 [133860.370205] Call Trace: [133860.370205] [<ffffffffa052d4d0>] btrfs_set_lock_blocking_rw+0x66/0xbd [btrfs] [133860.370205] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs] [133860.370205] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs] [133860.370205] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs] [133860.370205] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs] [133860.370205] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs] [133860.370205] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs] [133860.370205] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.370205] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [133860.370205] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [133860.370205] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [133860.370205] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [133860.370205] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [133860.370205] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [133860.370205] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f This bug happened because we assumed that by setting keep_locks to 1 in our search path, our path after a call to btrfs_search_slot() would have all nodes locked, which is not always true because unlock_up() (called by btrfs_search_slot()) will unlock a node in a path if the slot of the node below it doesn't point to the last item or beyond the last item. For example, when the tree has a heigth of 2 and path->slots[0] has a value smaller than btrfs_header_nritems(path->nodes[0]) - 1, the node at level 2 will be unlocked (also because lowest_unlock is set to 1 due to the fact that the value passed as ins_len to btrfs_search_slot is 0). This resulted in btrfs_find_next_key(), called before btrfs_realloc_node(), to release out path and call again btrfs_search_slot(), but this time with the cow parameter set to 0, meaning the resulting path got only read locks. Therefore when we called btrfs_realloc_node(), with path->nodes[1] having a read lock, it resulted in the warning and BUG_ON when calling btrfs_set_lock_blocking() against the node, as that function expects the node to have a write lock. The second bug happened often when the first bug didn't happen, and made us hang and hitting the following warning at fs/btrfs/locking.c: 251 void btrfs_tree_lock(struct extent_buffer *eb) 252 { 253 WARN_ON(eb->lock_owner == current->pid); This happened because the tree search we made at btrfs_defrag_leaves() before calling btrfs_find_next_key() locked a leaf and all the other nodes in the path, so btrfs_find_next_key() had no need to release the path and make a new search (with path->lowest_level set to 1). This made btrfs_realloc_node() attempt to write lock the same leaf again, resulting in a hang/deadlock. So fix these issues by calling btrfs_find_next_key() after calling btrfs_realloc_node() and setting the search path's lowest_level to 1 to avoid the hang/deadlock when attempting to write lock the leaves at btrfs_realloc_node(). Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds115-671/+1045
Pull networking fixes from David Miller: 1) Fix uninitialized variable warnings in nfnetlink_queue, a lot of people reported this... From Arnd Bergmann. 2) Don't init mutex twice in i40e driver, from Jesse Brandeburg. 3) Fix spurious EBUSY in rhashtable, from Herbert Xu. 4) Missing DMA unmaps in mvpp2 driver, from Marcin Wojtas. 5) Fix race with work structure access in pppoe driver causing corruptions, from Guillaume Nault. 6) Fix OOPS due to sh_eth_rx() not checking whether netdev_alloc_skb() actually succeeded or not, from Sergei Shtylyov. 7) Don't lose flags when settifn IFA_F_OPTIMISTIC in ipv6 code, from Bjørn Mork. 8) VXLAN_HD_RCO defined incorrectly, fix from Jiri Benc. 9) Fix clock source used for cookies in SCTP, from Marcelo Ricardo Leitner. 10) aurora driver needs HAS_DMA dependency, from Geert Uytterhoeven. 11) ndo_fill_metadata_dst op of vxlan has to handle ipv6 tunneling properly as well, from Jiri Benc. 12) Handle request sockets properly in xfrm layer, from Eric Dumazet. 13) Double stats update in ipv6 geneve transmit path, fix from Pravin B Shelar. 14) sk->sk_policy[] needs RCU protection, and as a result xfrm_policy_destroy() needs to free policies using an RCU grace period, from Eric Dumazet. 15) SCTP needs to clone ipv6 tx options in order to avoid use after free, from Eric Dumazet. 16) Missing kbuild export if ila.h, from Stephen Hemminger. 17) Missing mdiobus_alloc() return value checking in mdio-mux.c, from Tobias Klauser. 18) Validate protocol value range in ->create() methods, from Hannes Frederic Sowa. 19) Fix early socket demux races that result in illegal dst reuse, from Eric Dumazet. 20) Validate socket address length in pptp code, from WANG Cong. 21) skb_reorder_vlan_header() uses incorrect offset and can corrupt packets, from Vlad Yasevich. 22) Fix memory leaks in nl80211 registry code, from Ola Olsson. 23) Timeout loop count handing fixes in mISDN, xgbe, qlge, sfc, and qlcnic. From Dan Carpenter. 24) msg.msg_iocb needs to be cleared in recvfrom() otherwise, for example, AF_ALG will interpret it as an async call. From Tadeusz Struk. 25) inetpeer_set_addr_v4 forgets to initialize the 'vif' field, from Eric Dumazet. 26) rhashtable enforces the minimum table size not early enough, breaking how we calculate the per-cpu lock allocations. From Herbert Xu. 27) Fix FCC port lockup in 82xx driver, from Martin Roth. 28) FOU sockets need to be freed using RCU, from Hannes Frederic Sowa. 29) Fix out-of-bounds access in __skb_complete_tx_timestamp() and sock_setsockopt() wrt. timestamp handling. From WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (117 commits) net: check both type and procotol for tcp sockets drivers: net: xgene: fix Tx flow control tcp: restore fastopen with no data in SYN packet af_unix: Revert 'lock_interruptible' in stream receive code fou: clean up socket with kfree_rcu 82xx: FCC: Fixing a bug causing to FCC port lock-up gianfar: Don't enable RX Filer if not supported net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration rhashtable: Fix walker list corruption rhashtable: Enforce minimum size on initial hash table inet: tcp: fix inetpeer_set_addr_v4() ipv6: automatically enable stable privacy mode if stable_secret set net: fix uninitialized variable issue bluetooth: Validate socket address length in sco_sock_bind(). net_sched: make qdisc_tree_decrease_qlen() work for non mq ser_gigaset: remove unnecessary kfree() calls from release method ser_gigaset: fix deallocation of platform device structure ser_gigaset: turn nonsense checks into WARN_ON ser_gigaset: fix up NULL checks qlcnic: fix a timeout loop ...
2015-12-17net: check both type and procotol for tcp socketsWANG Cong2-2/+4
Dmitry reported the following out-of-bound access: Call Trace: [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40 mm/kasan/report.c:294 [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880 [< inline >] SYSC_setsockopt net/socket.c:1746 [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729 [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185 This is because we mistake a raw socket as a tcp socket. We should check both sk->sk_type and sk->sk_protocol to ensure it is a tcp socket. Willem points out __skb_complete_tx_timestamp() needs to fix as well. Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17drivers: net: xgene: fix Tx flow controlIyappan Subramanian2-18/+24
Currently the Tx flow control is based on reading the hardware state, which is not accurate since it may not reflect the descriptors that are not yet reached the memory. To accurately control the Tx flow, changing it to be software based. Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17tcp: restore fastopen with no data in SYN packetEric Dumazet1-11/+12
Yuchung tracked a regression caused by commit 57be5bdad759 ("ip: convert tcp_sendmsg() to iov_iter primitives") for TCP Fast Open. Some Fast Open users do not actually add any data in the SYN packet. Fixes: 57be5bdad759 ("ip: convert tcp_sendmsg() to iov_iter primitives") Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17af_unix: Revert 'lock_interruptible' in stream receive codeRainer Weikusat1-10/+3
With b3ca9b02b00704053a38bfe4c31dbbb9c13595d0, the AF_UNIX SOCK_STREAM receive code was changed from using mutex_lock(&u->readlock) to mutex_lock_interruptible(&u->readlock) to prevent signals from being delayed for an indefinite time if a thread sleeping on the mutex happened to be selected for handling the signal. But this was never a problem with the stream receive code (as opposed to its datagram counterpart) as that never went to sleep waiting for new messages with the mutex held and thus, wouldn't cause secondary readers to block on the mutex waiting for the sleeping primary reader. As the interruptible locking makes the code more complicated in exchange for no benefit, change it back to using mutex_lock. Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17Btrfs: add free space tree mount optionOmar Sandoval3-10/+88
Now we can finally hook up everything so we can actually use free space tree. The free space tree is enabled by passing the space_cache=v2 mount option. On the first mount with the this option set, the free space tree will be created and the FREE_SPACE_TREE read-only compat bit will be set. Any time the filesystem is mounted from then on, we must use the free space tree. The clear_cache option will also clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: wire up the free space tree to the extent treeOmar Sandoval1-3/+33
The free space tree is updated in tandem with the extent tree. There are only a handful of places where we need to hook in: 1. Block group creation 2. Block group deletion 3. Delayed refs (extent creation and deletion) 4. Block group caching Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add free space tree sanity testsOmar Sandoval7-48/+646
This tests the operations on the free space tree trying to excercise all of the main cases for both formats. Between this and xfstests, the free space tree should have pretty good coverage. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: implement the free space B-treeOmar Sandoval5-4/+1686
The free space cache has turned out to be a scalability bottleneck on large, busy filesystems. When the cache for a lot of block groups needs to be written out, we can get extremely long commit times; if this happens in the critical section, things are especially bad because we block new transactions from happening. The main problem with the free space cache is that it has to be written out in its entirety and is managed in an ad hoc fashion. Using a B-tree to store free space fixes this: updates can be done as needed and we get all of the benefits of using a B-tree: checksumming, RAID handling, well-understood behavior. With the free space tree, we get commit times that are about the same as the no cache case with load times slower than the free space cache case but still much faster than the no cache case. Free space is represented with extents until it becomes more space-efficient to use bitmaps, giving us similar space overhead to the free space cache. The operations on the free space tree are: adding and removing free space, handling the creation and deletion of block groups, and loading the free space for a block group. We can also create the free space tree by walking the extent tree and clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: introduce the free space B-tree on-disk formatOmar Sandoval2-1/+40
The on-disk format for the free space tree is straightforward. Each block group is represented in the free space tree by a free space info item that stores accounting information: whether the free space for this block group is stored as bitmaps or extents and how many extents of free space exist for this block group (regardless of which format is being used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys with no corresponding item, and bitmaps instead have the FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an array of bytes. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: refactor caching_thread()Omar Sandoval2-26/+36
We're also going to load the free space tree from caching_thread(), so we should refactor some of the common code. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add helpers for read-only compat bitsOmar Sandoval1-0/+82
We're finally going to add one of these for the free space tree, so let's add the same nice helpers that we have for the incompat bits. While we're add it, also add helpers to clear the bits. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add extent buffer bitmap sanity testsOmar Sandoval3-16/+160
Sanity test the extent buffer bitmap operations (test, set, and clear) against the equivalent standard kernel operations. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add extent buffer bitmap operationsOmar Sandoval2-0/+155
These are going to be used for the free space tree bitmap items. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds5-11/+8
Pull drm fixes from Dave Airlie: "Some i915 fixes, one omap fix, one core regression fix. Not even enough fixes for a twelve days of xmas song, which seemms good" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: drm: Don't overwrite UNVERFIED mode status to OK drm/omap: fix fbdev pix format to support all platforms drm/i915: Do a better job at disabling primary plane in the noatomic case. drm/i915/skl: Double RC6 WRL always on drm/i915/skl: Disable coarse power gating up until F0 drm/i915: Remove incorrect warning in context cleanup
2015-12-17locking/osq: Fix ordering of node initialisation in osq_lockWill Deacon1-3/+5
The Cavium guys reported a soft lockup on their arm64 machine, caused by commit c55a6ffa6285 ("locking/osq: Relax atomic semantics"): mutex_optimistic_spin+0x9c/0x1d0 __mutex_lock_slowpath+0x44/0x158 mutex_lock+0x54/0x58 kernfs_iop_permission+0x38/0x70 __inode_permission+0x88/0xd8 inode_permission+0x30/0x6c link_path_walk+0x68/0x4d4 path_openat+0xb4/0x2bc do_filp_open+0x74/0xd0 do_sys_open+0x14c/0x228 SyS_openat+0x3c/0x48 el0_svc_naked+0x24/0x28 This is because in osq_lock we initialise the node for the current CPU: node->locked = 0; node->next = NULL; node->cpu = curr; and then publish the current CPU in the lock tail: old = atomic_xchg_acquire(&lock->tail, curr); Once the update to lock->tail is visible to another CPU, the node is then live and can be both read and updated by concurrent lockers. Unfortunately, the ACQUIRE semantics of the xchg operation mean that there is no guarantee the contents of the node will be visible before lock tail is updated. This can lead to lock corruption when, for example, a concurrent locker races to set the next field. Fixes: c55a6ffa6285 ("locking/osq: Relax atomic semantics"): Reported-by: David Daney <ddaney@caviumnetworks.com> Reported-by: Andrew Pinski <andrew.pinski@caviumnetworks.com> Tested-by: Andrew Pinski <andrew.pinski@caviumnetworks.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-17Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds7-10/+11
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: - Two bug fixes for misuse of PAGE_MASK in scatterlist and dma-debug. These are tagged for -stable. The scatterlist impact is potentially corrupted dma addresses on HIGHMEM enabled platforms. - A minor locking fix for the NFIT hot-add implementation that is new in 4.4-rc. This would only trigger in the case a hot-add raced driver removal. * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dma-debug: Fix dma_debug_entry offset calculation Revert "scatterlist: use sg_phys()" nfit: acpi_nfit_notify(): Do not leave device locked
2015-12-17Merge remote-tracking branch 'mkp-scsi/4.4/scsi-fixes' into fixesJames Bottomley1-10/+10
2015-12-17gpio: revert get() to non-errorprogating behaviourLinus Walleij1-1/+7
commit e20538b82f1f ("gpio: Propagate errors from chip->get()") started to propagate errors from the .get() functions since we can get errors from the infrastructure of e.g. slowbus GPIO expanders. However it turns out a bunch of drivers relied on the core to clamp the value, so we need to revert to the old behaviour and go over all drivers and fix them to conform to the expectations of the core before we go back to propagating the error code. Cc: stable@vger.kernel.org # 4.3+ Cc: Bjorn Andersson <bjorn.andersson@sonymobile.com> Cc: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Fixes: e20538b82f1f ("gpio: Propagate errors from chip->get()") Reported-by: Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-12-17gpio: generic: clamp values from bgpio_get_set()Linus Walleij1-2/+2
The bgpio_get_set() call should return a value clamped to [0,1], the current code will return a negative value if reading bit 31, which turns the value negative as this is a signed value and thus gets interpreted as an error by the gpiolib core. Found on the gpio-mxc but applies to any MMIO driver. Cc: stable@vger.kernel.org # 4.3+ Cc: kernel@pengutronix.de Cc: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Fixes: e20538b82f1f ("gpio: Propagate errors from chip->get()") Reported-by: Clemens Gruber <clemens.gruber@pqgruber.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2015-12-17Btrfs: fix leaking of ordered extents after direct IO write errorFilipe Manana1-5/+38
When doing a direct IO write, __blockdev_direct_IO() can call the btrfs_get_blocks_direct() callback one or more times before it calls the btrfs_submit_direct() callback. However it can fail after calling the first callback and before calling the second callback, which is a problem because the first one creates ordered extents and the second one is the one that submits bios that cover the ordered extents created by the first one. That means the ordered extents will never complete nor have any of the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in subsequent operations (such as other direct IO writes, buffered writes or hole punching) that lock the same IO range and lookup for ordered extents in the range to hang forever waiting for those ordered extents because they can not complete ever, since no bio was submitted. Fix this by tracking a range of created ordered extents that don't have yet corresponding bios submitted and completing the ordered extents in the range if __blockdev_direct_IO() fails with an error. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Btrfs: fix deadlock between direct IO write and defrag/readpagesFilipe Manana1-17/+13
If readpages() (triggered by defrag or buffered reads) is called while a direct IO write is in progress, we have a small time window where we can deadlock, resulting in traces like the following being generated: [84723.212993] INFO: task fio:2849 blocked for more than 120 seconds. [84723.214310] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84723.217313] fio D ffff88023ec75218 0 2849 2835 0x00000000 [84723.218778] ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200 [84723.220458] ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff [84723.230597] 0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a [84723.232085] Call Trace: [84723.232625] [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c [84723.233529] [<ffffffff8147856a>] schedule+0x7d/0x95 [84723.234398] [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b [84723.235384] [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28 [84723.236426] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf [84723.237502] [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d [84723.238807] [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab [84723.242012] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf [84723.243064] [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33 [84723.244116] [<ffffffff810afa2e>] ? ktime_get+0x41/0x52 [84723.245029] [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b [84723.245942] [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b [84723.246596] [<ffffffff81478953>] bit_wait_io+0x39/0x45 [84723.247503] [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d [84723.248540] [<ffffffff8111684f>] __lock_page+0x66/0x68 [84723.249558] [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a [84723.250844] [<ffffffff81124a04>] lock_page+0x2c/0x2f [84723.251871] [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa [84723.253274] [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146 [84723.254757] [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15 [84723.256378] [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs] [84723.258556] [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111 [84723.260064] [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb [84723.261479] [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs] [84723.262961] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [84723.264449] [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33 [84723.265614] [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33 [84723.266769] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [84723.268264] [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs] [84723.270954] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [84723.272465] [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128 [84723.273734] [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs] [84723.275101] [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5 [84723.276200] [<ffffffff8116cfab>] vfs_write+0xa0/0xe4 [84723.277298] [<ffffffff8116d79d>] SyS_write+0x50/0x7e [84723.278327] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [84723.279595] INFO: lockdep is turned off. [84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds. [84723.380323] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84723.383003] btrfs D ffff88023ed75218 0 2923 2859 0x00000000 [84723.384277] ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200 [84723.385748] ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30 [84723.387152] ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a [84723.388620] Call Trace: [84723.389105] [<ffffffff8147856a>] schedule+0x7d/0x95 [84723.391882] [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs] [84723.393718] [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31 [84723.395659] [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs] [84723.397383] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs] [84723.398852] [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs] [84723.400561] [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72 [84723.401787] [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs] [84723.403121] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs] [84723.404583] [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs] [84723.406007] [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4 [84723.407502] [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e [84723.408937] [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e [84723.410487] [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f [84723.411710] [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs] [84723.413007] [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs] [84723.414085] [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs] [84723.415307] [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs] [84723.416532] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [84723.417731] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [84723.418699] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [84723.421532] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [84723.422629] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [84723.423712] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [84723.424801] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [84723.425968] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [84723.427063] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [84723.428138] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f Consider the following logical and physical file layout: logical: ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ... 4K 8K 16K physical: ... 12853248 12857344 1103101952 ... (= 12853248 + 4K) Extents A and B are physically adjacent. The following diagram shows a sequence of events that lead to the deadlock when we attempt to do a direct IO write against the file range [4K, 16K[ and a defrag is triggered simultaneously. CPU 1 CPU 2 btrfs_direct_IO() btrfs_get_blocks_direct() creates ordered extent A, covering the 4k prealloc extent A (range [4K, 8K[) btrfs_defrag_file() page_cache_sync_readahead([0K, 1M[) btrfs_readpages() extent_readpages() locks all pages in the file range [0K, 128K[ through calls to add_to_page_cache_lru() __do_contiguous_readpages() finds ordered extent A waits for it to complete btrfs_get_blocks_direct() called again lock_extent_direct(range [8K, 16K[) finds a page in range [8K, 16K[ through btrfs_page_exists_in_range() invalidate_inode_pages2_range([8K, 16K[) --> tries to lock pages that are already locked by the task at CPU 2 --> our task, running __blockdev_direct_IO(), hangs waiting to lock the pages and the submit bio callback, btrfs_submit_direct(), ends up never being called, resulting in the ordered extent A never completing (because a corresponding bio is never submitted) and CPU 2 will wait for it forever while holding the pages locked ---> deadlock! Fix this by removing the page invalidation approach when attempting to lock the range for IO from the callback btrfs_get_blocks_direct() and falling back buffered IO. This was a rare case anyway and well behaved applications do not mix concurrent direct IO writes with buffered reads anyway, being a concurrent defrag the only normal case that could lead to the deadlock. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Btrfs: fix error path when failing to submit bio for direct IO writeFilipe Manana1-27/+27
Commit 61de718fceb6 ("Btrfs: fix memory corruption on failure to submit bio for direct IO") fixed problems with the error handling code after we fail to submit a bio for direct IO. However there were 2 problems that it did not address when the failure is due to memory allocation failures for direct IO writes: 1) We considered that there could be only one ordered extent for the whole IO range, which is not always true, as we can have multiple; 2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent, which can make other tasks running btrfs_wait_logged_extents() hang forever, since they wait for that bit to be set. The general assumption is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set and it precedes setting the bit BTRFS_ORDERED_COMPLETE. Fix these issues by moving part of the btrfs_endio_direct_write() handler into a new helper function and having that new helper function called when we fail to allocate memory to submit the bio (and its private object) for a direct IO write. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17Btrfs: fix memory leaks after transaction is abortedFilipe Manana1-0/+17
When a transaction is aborted, or its commit fails before writing the new superblock and calling btrfs_finish_extent_commit(), we leak reference counts on the block groups attached to the transaction's delete_bgs list, because btrfs_finish_extent_commit() is never called for those two cases. Fix this by dropping their references at btrfs_put_transaction(), which is called when transactions are aborted (by making the transaction kthread commit the transaction) or if their commits fail. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Btrfs: fix race when finishing dev replace leading to transaction abortFilipe Manana1-2/+15
During the final phase of a device replace operation, I ran into a transaction abort that resulted in the following trace: [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]() [23919.664742] BTRFS: Transaction aborted (error -2) [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs] [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1 [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98 [23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48 [23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0 [23919.696716] Call Trace: [23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs] [23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs] [23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs] [23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs] [23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194 [23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78 [23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 [23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 [23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff [23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b [23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6 [23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [23919.733330] ---[ end trace 166ef301a335832a ]--- This is due to a race between device replace and chunk allocation, which the following diagram illustrates: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_fallocate() btrfs_alloc_data_chunk_ondemand() btrfs_join_transaction() --> starts a new transaction do_chunk_alloc() lock fs_info->chunk_mutex btrfs_alloc_chunk() --> creates extent map for the new chunk with em->bdev->map->stripes[i]->dev->devid == X (X > 0) --> extent map is added to fs_info->mapping_tree --> initial phase of bg A allocation completes unlock fs_info->chunk_mutex lock fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) btrfs_end_transaction() btrfs_create_pending_block_groups() --> starts final phase of bg A creation (update device, extent, and chunk trees, etc) btrfs_finish_chunk_alloc() btrfs_update_device() --> attempts to update a device item with ID == 0ULL (BTRFS_DEV_REPLACE_DEVID) which is the current ID of bg A's em->bdev->map->stripes[i]->dev->devid --> doesn't find such item returns -ENOENT --> the device id should have been X and not 0ULL got -ENOENT from btrfs_finish_chunk_alloc() and aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid, which is X (and X > 0) frees the srcdev unlock fs_info->chunk_mutex So fix this by taking the device list mutex when processing the chunk's extent map stripes to update the device items. This avoids getting the wrong device id and use-after-free problems if the task finishing a chunk allocation grabs the replaced device, which is freed while the dev replace task is holding the device list mutex. This happened while running fstest btrfs/071. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17powerpc/powernv: pr_warn_once on unsupported OPAL_MSG typeStewart Smith1-1/+1
When running on newer OPAL firmware that supports sending extra OPAL_MSG types, we would print a warning on *every* message received. This could be a problem for kernels that don't support OPAL_MSG_OCC on machines that are running real close to thermal limits and the OCC is throttling the chip. For a kernel that is paying attention to the message queue, we could get these notifications quite often. Conceivably, future message types could also come fairly often, and printing that we didn't understand them 10,000 times provides no further information than printing them once. Cc: stable@vger.kernel.org Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-17ARC: smp: Rename platform hook @init_cpu_smp -> @init_per_cpuVineet Gupta3-6/+6
Makes it similar to smp_ops which also has callback with same name Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-17ARC: rename smp operation init_irq_cpu() to init_per_cpu()Noam Camus4-7/+7
This will better reflect its description i.e. "any needed setup..." and not just do an "IPI request". Signed-off-by: Noam Camus <noamc@ezchip.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-17ARC: dw2 unwind: Ignore CIE version !=1 gracefully instead of bailingVineet Gupta1-4/+9
ARC dwarf unwinder only supports CIE version == 1 The boot time dwarf sanitizer (part of binary lookup table constructor) would simply bail if it saw CIE version == 3, rendering unwinder with a NULL lookup table. It seems libgcc linked with kernel does have such entries. With fallback linear search removed, and a NULL binary lookup table, unwinder fails to generate any stack trace. So allow graceful ignoring of unsupported CIE entries. This problem was initially seen in Alexey's setup (and not mine) as he was using buildroot built toolchain (libgcc) which doesn't get built with CFLAGS_FOR_TARGET="-gdwarf-2 which is my default Fixes STAR 9000985048: "kernel unwinder broken with stock tools" Fixes: 2e22502c080f ARC: dw2 unwind: Remove falllback linear search thru FDE entries Reported-by Alexey Brodkin <abrodkin@synopsys.com> Cc: <stable@vger.kernel.org> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-17ARC: dw2 unwind: Reinstante unwinding out of modulesVineet Gupta3-19/+26
The fix which removed linear searching of dwarf (because binary lookup data always exists) missed out on the fact that modules don't get the binary lookup tables info. This caused unwinding out of modules to stop working. So add binary lookup header setup (equivalent of eh_frame_hdr setup) to modules as well. While at it, confine the header setup to within unwinder code, reducing one API exposed out of unwinder code. Fixes: 2e22502c080f ARC: dw2 unwind: Remove falllback linear search thru FDE entries Cc: <stable@vger.kernel.org> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-17ARC: [plat-sim] unbork non default CONFIG_LINUX_LINK_BASEVineet Gupta3-2/+6
HIGHMEM support bumped the default memory size for nsim platform to 1G. Thus total memory ended at the very edge of start of peripherals address space. With linux link base shifted, memory started bleeding into peripheral space which caused early boot bad_page spew ! Fixes: 29e332261d2 ("ARC: mm: HIGHMEM: populate high memory from DT") Reported-by: Anton Kolesov <akolesov@synopsys.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-16fou: clean up socket with kfree_rcuHannes Frederic Sowa1-1/+2
fou->udp_offloads is managed by RCU. As it is actually included inside the fou sockets, we cannot let the memory go out of scope before a grace period. We either can synchronize_rcu or switch over to kfree_rcu to manage the sockets. kfree_rcu seems appropriate as it is used by vxlan and geneve. Fixes: 23461551c00628c ("fou: Support for foo-over-udp RX path") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16Merge tag 'mac80211-for-davem-2015-12-15' of ↵David S. Miller9-74/+92
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Another set of fixes: * memory leak fixes (from Ola) * operating mode notification spec compliance fix (from Eyal) * copy rfkill names in case pointer becomes invalid (myself) * two hardware restart fixes (myself) * get rid of "limiting TX power" log spam (myself) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-1682xx: FCC: Fixing a bug causing to FCC port lock-upMartin Roth1-1/+1
The patch fixes FCC port lock-up, which occurs as a result of a bug during underrun/collision handling. Within the tx_startup() function in mac-fcc.c, the address of last BD is not calculated correctly. As a result of wrong calculation of the last BD address, the next transmitted BD may be set to an area out of the transmit BD ring. This actually causes to port lock-up and it is not recoverable. Signed-off-by: Martin Roth <martin.roth@motorolasolutions.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16gianfar: Don't enable RX Filer if not supportedHamish Martin2-3/+6
After commit 15bf176db1fb ("gianfar: Don't enable the Filer w/o the Parser"), 'TSEC' model controllers (for example as seen on MPC8541E) always have 8 bytes stripped from the front of received frames. Only 'eTSEC' gianfar controllers have the RX Filer capability (amongst other enhancements). Previously this was treated as always enabled for both 'TSEC' and 'eTSEC' controllers. In commit 15bf176db1fb ("gianfar: Don't enable the Filer w/o the Parser") a subtle change was made to the setting of 'uses_rxfcb' to effectively always set it (since 'rx_filer_enable' was always true). This had the side-effect of always stripping 8 bytes from the front of received frames on 'TSEC' type controllers. We now only enable the RX Filer capability on controller types that support it, thereby avoiding the issue for 'TSEC' type controllers. Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Reviewed-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16dma-debug: Fix dma_debug_entry offset calculationDaniel Mentz1-2/+2
dma-debug uses struct dma_debug_entry to keep track of dma coherent memory allocation requests. The virtual address is converted into a pfn and an offset. Previously, the offset was calculated using an incorrect bit mask. As a result, we saw incorrect error messages from dma-debug like the following: "DMA-API: exceeded 7 overlapping mappings of cacheline 0x03e00000" Cacheline 0x03e00000 does not exist on our platform. Cc: <stable@vger.kernel.org> Fixes: 0abdd7a81b7e ("dma-debug: introduce debug_dma_assert_idle()") Signed-off-by: Daniel Mentz <danielmentz@google.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-16Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-armLinus Torvalds7-68/+138
Pull ARM fixes from Russell King: "Further ARM fixes: - Anson Huang noticed that we were corrupting a register we shouldn't be during suspend on some CPUs. - Shengjiu Wang spotted a bug in the 'swp' instruction emulation. - Will Deacon fixed a bug in the ASID allocator. - Laura Abbott fixed the kernel permission protection to apply to all threads running in the system. - I've fixed two bugs with the domain access control register handling, one to do with printing an appropriate value at oops time, and the other to further fix the uaccess_with_memcpy code" * 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: ARM: 8475/1: SWP emulation: Restore original *data when failed ARM: 8471/1: need to save/restore arm register(r11) when it is corrupted ARM: fix uaccess_with_memcpy() with SW_DOMAIN_PAN ARM: report proper DACR value in oops dumps ARM: 8464/1: Update all mm structures with section adjustments ARM: 8465/1: mm: keep reserved ASIDs in sync with mm after multiple rollovers
2015-12-16net: fix warnings in 'make htmldocs' by moving macro definition out of field ↵Hannes Frederic Sowa1-1/+1
declaration Docbook does not like the definition of macros inside a field declaration and adds a warning. Move the definition out. Fixes: 79462ad02e86180 ("net: add validation for the socket syscall protocol argument") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16rhashtable: Fix walker list corruptionHerbert Xu1-9/+7
The commit ba7c95ea3870fe7b847466d39a049ab6f156aa2c ("rhashtable: Fix sleeping inside RCU critical section in walk_stop") introduced a new spinlock for the walker list. However, it did not convert all existing users of the list over to the new spin lock. Some continued to use the old mutext for this purpose. This obviously led to corruption of the list. The fix is to use the spin lock everywhere where we touch the list. This also allows us to do rcu_rad_lock before we take the lock in rhashtable_walk_start. With the old mutex this would've deadlocked but it's safe with the new spin lock. Fixes: ba7c95ea3870 ("rhashtable: Fix sleeping inside RCU...") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16rhashtable: Enforce minimum size on initial hash tableHerbert Xu1-3/+3
William Hua <william.hua@canonical.com> wrote: > > I wasn't aware there was an enforced minimum size. I simply set the > nelem_hint in the rhastable_params struct to 1, expecting it to grow as > needed. This caused a segfault afterwards when trying to insert an > element. OK we're doing the size computation before we enforce the limit on min_size. ---8<--- We need to do the initial hash table size computation after we have obtained the correct min_size/max_size parameters. Otherwise we may end up with a hash table whose size is outside the allowed envelope. Fixes: a998f712f77e ("rhashtable: Round up/down min/max_size to...") Reported-by: William Hua <william.hua@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16Merge remote-tracking branches 'spi/fix/dspi' and 'spi/fix/spidev' into ↵Mark Brown2-7/+7
spi-linus
2015-12-16Merge remote-tracking branch 'spi/fix/core' into spi-linusMark Brown1-1/+1
2015-12-16spi: fix parent-device reference leakJohan Hovold1-1/+1
Fix parent-device reference leak due to SPI-core taking an unnecessary reference to the parent when allocating the master structure, a reference that was never released. Note that driver core takes its own reference to the parent when the master device is registered. Fixes: 49dce689ad4e ("spi doesn't need class_device") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org
2015-12-16spi: spidev: Hold spi_lock over all defererences of spi in release()Mark Brown1-1/+1
We use the spi_lock spinlock to protect against races between the device being removed and file operations on the spidev. This means that in the removal path all references to the device need to be done under lock as in removal we dropping references to the device. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2015-12-16Partial revert of "powerpc: Individual System V IPC system calls"Michael Ellerman2-24/+12
This partially reverts commit a34236155afb1cc41945e58388ac988431bcb0b8. While reviewing the glibc patch to exploit the individual IPC calls, Arnd & Andreas noticed that we were still requiring userspace to pass IPC_64 in order to get the new style IPC API. With a bit of cleanup in the kernel we can drop that requirement, and instead only provide the new style API, which will simplify things for userspace. Rather than try and sneak that patch into 4.4, instead we will drop the individual IPC calls for powerpc, and merge them again in 4.5 once the cleanup patch has gone in. Because we've already added sys_mlock2() as syscall #378, we don't do a full revert of the IPC calls. Instead we drop the __NR #defines, and send those now undefined syscall numbers to sys_ni_syscall(). This leaves a gap in the syscall numbers, but we'll reuse them when we merge the individual IPC calls. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Arnd Bergmann <arnd@arndb.de>
2015-12-16inet: tcp: fix inetpeer_set_addr_v4()Eric Dumazet1-0/+1
David Ahern added a vif field in the a4 part of inetpeer_addr struct. This broke IPv4 TCP fast open client side and more generally tcp metrics cache, because inetpeer_addr_cmp() is now comparing two u32 instead of one. inetpeer_set_addr_v4() needs to properly init vif field, otherwise the comparison result depends on uninitialized data. Fixes: 192132b9a034 ("net: Add support for VRFs to inetpeer cache") Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15ipv6: automatically enable stable privacy mode if stable_secret setHannes Frederic Sowa1-0/+6
Bjørn reported that while we switch all interfaces to privacy stable mode when setting the secret, we don't set this mode for new interfaces. This does not make sense, so change this behaviour. Fixes: 622c81d57b392cc ("ipv6: generation of stable privacy addresses for link-local and autoconf") Reported-by: Bjørn Mork <bjorn@mork.no> Cc: Bjørn Mork <bjorn@mork.no> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15Revert "scatterlist: use sg_phys()"Dan Williams5-7/+8
commit db0fa0cb0157 "scatterlist: use sg_phys()" did replacements of the form: phys_addr_t phys = page_to_phys(sg_page(s)); phys_addr_t phys = sg_phys(s) & PAGE_MASK; However, this breaks platforms where sizeof(phys_addr_t) > sizeof(unsigned long). Revert for 4.3 and 4.4 to make room for a combined helper in 4.5. Cc: <stable@vger.kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: db0fa0cb0157 ("scatterlist: use sg_phys()") Suggested-by: Joerg Roedel <joro@8bytes.org> Reported-by: Vitaly Lavrov <vel21ripn@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-15net: fix uninitialized variable issuetadeusz.struk@intel.com1-0/+1
msg_iocb needs to be initialized on the recv/recvfrom path. Otherwise afalg will wrongly interpret it as an async call. Cc: stable@vger.kernel.org Reported-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15bluetooth: Validate socket address length in sco_sock_bind().David S. Miller1-0/+3
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15Input: elan_i2c - set input device's vendor and product IDsCharlie Mooney1-0/+3
Previously the "vendor" and "product" IDs for the elan_i2c driver simply reported 0000. This patch modifies the elan_i2c driver to include the Elan vendor ID and the touchpad's product id under input/input*/{vendor,product}. Specifically, this is to allow us to apply a generic Elan gestures config that will apply to all Elan touchpads on ChromeOS. These configs match to input devices in various ways, but one major way is by matching on vendor ID. Adding this patch allows the default Elan touchpad config to be applied to Elan touchpads in this kernel by matching on devices that have vendor ID 04f3. Note that product ID is also available via custom sysfs entry "product_id" as well. Signed-off-by: Charlie Mooney <charliemooney@chromium.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2015-12-15Merge tag 'dmaengine-fix-4.4-rc6' of ↵Linus Torvalds6-57/+110
git://git.infradead.org/users/vkoul/slave-dma Pull dmaengine fixes from Vinod Koul: "This has fixes spread thru driver, notably among them: - edma fixes for recent edma DT changes which went into 4.4 - odd fixes for at_hdmac - minor fixes on bc dma and mic dma" * tag 'dmaengine-fix-4.4-rc6' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: at_xdmac: fix at_xdmac_prep_dma_memcpy() dmaengine: edma: DT: Change reserved slot array from 16bit to 32bit type dmaengine: edma: DT: Change memcpy channel array from 16bit to 32bit type dmaengine: mic_x100: add missing spin_unlock dmaengine: bcm2835-dma: Convert to use DMA pool dmaengine: at_xdmac: fix bad behavior in interleaved mode dmaengine: at_xdmac: fix false condition for memset_sg transfers dmaengine: at_xdmac: fix macro typo
2015-12-15Merge tag 'fbdev-fixes-4.4' of ↵Linus Torvalds2-1/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux Pull two fbdev fixes from Tomi Valkeinen: - OMAP: fix analog tv-out when using omapdrm - fsl: Fix kernel crash when diu_ops is not implemented * tag 'fbdev-fixes-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: OMAPDSS: fix timings for VENC to match what omapdrm expects video: fbdev: fsl: Fix kernel crash when diu_ops is not implemented
2015-12-15Merge tag 'please-pull-mlock2' of ↵Linus Torvalds3-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull ia64 fix from Tony Luck: "Wire up mlock2() syscall for ia64" * tag 'please-pull-mlock2' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: [IA64] Enable mlock2 syscall for ia64
2015-12-15net_sched: make qdisc_tree_decrease_qlen() work for non mqEric Dumazet1-1/+1
Stas Nichiporovich reported a regression in his HFSC qdisc setup on a non multi queue device. It turns out I mistakenly added a TCQ_F_NOPARENT flag on all qdisc allocated in qdisc_create() for non multi queue devices, which was rather buggy. I was clearly mislead by the TCQ_F_ONETXQUEUE that is also set here for no good reason, since it only matters for the root qdisc. Fixes: 4eaf3b84f288 ("net_sched: fix qdisc_tree_decrease_qlen() races") Reported-by: Stas Nichiporovich <stasn77@gmail.com> Tested-by: Stas Nichiporovich <stasn77@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15Merge branch 'ser_gigaset-platform-device-dealloc'David S. Miller1-12/+11
Paul Bolle says: ==================== ser_gigaset: fix deallocation of platform device structure Sascha Levin reported that the syzkaller fuzzer triggered a WARNING in ser_gigaset (see https://lkml.kernel.org/g/56587467.8050102@oracle.com ). It turned out that ser_gigaset has always deallocated its platform device structure incorrectly. Tilman submitted the patch that fixes that (3/4) and a related cleanup (4/4). Tilman also submitted a minor cleanup of some NULL checks (1/4) that prompted Alan to turn those checks into WARN_ONs (2/4). If no one hits these WARN_ONs in the next couple of releases these WARN_ONs should be removed. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>