aboutsummaryrefslogtreecommitdiffstats
path: root/net/sched/sch_sfq.c
AgeCommit message (Collapse)AuthorFilesLines
2024-02-02net/sched: Add module aliases for cls_,sch_,act_ modulesMichal Koutný1-0/+1
No functional change intended, aliases will be used in followup commits. Note for backporters: you may need to add aliases also for modules that are already removed in mainline kernel but still in your version. Patches were generated with the help of Coccinelle scripts like: cat >scripts/coccinelle/misc/tcf_alias.cocci <<EOD virtual patch virtual report @ haskernel @ @@ @ tcf_has_kind depends on report && haskernel @ identifier ops; constant K; @@ static struct tcf_proto_ops ops = { .kind = K, ... }; +char module_alias = K; EOD /usr/bin/spatch -D report --cocci-file scripts/coccinelle/misc/tcf_alias.cocci \ --dir . \ -I ./arch/x86/include -I ./arch/x86/include/generated -I ./include \ -I ./arch/x86/include/uapi -I ./arch/x86/include/generated/uapi \ -I ./include/uapi -I ./include/generated/uapi \ --include ./include/linux/compiler-version.h --include ./include/linux/kconfig.h \ --jobs 8 --chunksize 1 2>/dev/null | \ sed 's/char module_alias = "\([^"]*\)";/MODULE_ALIAS_NET_CLS("\1");/' And analogously for: static struct tc_action_ops ops = { .kind = K, static struct Qdisc_ops ops = { .id = K, (Someone familiar would be able to fit those into one .cocci file without sed post processing.) Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240201130943.19536-3-mkoutny@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-01net: sched: Fill in missing MODULE_DESCRIPTION for qdiscsVictor Nogueira1-0/+1
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Fill in missing MODULE_DESCRIPTIONs for TC qdiscs. Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20231027155045.46291-4-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-07net: sched: add rcu annotations around qdisc->qdisc_sleepingEric Dumazet1-1/+4
syzbot reported a race around qdisc->qdisc_sleeping [1] It is time we add proper annotations to reads and writes to/from qdisc->qdisc_sleeping. [1] BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1: qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331 __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174 tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547 rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0: dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115 qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103 tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023 Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@nvidia.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-22net/sched: use tc_qdisc_stats_dump() in qdiscZhengchao Shao1-6/+2
use tc_qdisc_stats_dump() in qdisc. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-29net/sched: store the last executed chain also for clsact egressDavide Caratti1-1/+1
currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain id' in the skb extension. However, userspace programs (like ovs) are able to setup egress rules, and datapath gets confused in case it doesn't find the 'chain id' for a packet that's "recirculated" by tc. Change tcf_classify() to have the same semantic as tcf_classify_ingress() so that a single function can be called in ingress / egress, using the tc ingress / egress block respectively. Suggested-by: Alaa Hleilel <alaa@nvidia.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10net: sched: validate stab valuesEric Dumazet1-1/+1
iproute2 package is well behaved, but malicious user space can provide illegal shift values and trigger UBSAN reports. Add stab parameter to red_check_params() to validate user input. syzbot reported: UBSAN: shift-out-of-bounds in ./include/net/red.h:312:18 shift exponent 111 is too large for 64-bit type 'long unsigned int' CPU: 1 PID: 14662 Comm: syz-executor.3 Not tainted 5.12.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327 red_calc_qavg_from_idle_time include/net/red.h:312 [inline] red_calc_qavg include/net/red.h:353 [inline] choke_enqueue.cold+0x18/0x3dd net/sched/sch_choke.c:221 __dev_xmit_skb net/core/dev.c:3837 [inline] __dev_queue_xmit+0x1943/0x2e00 net/core/dev.c:4150 neigh_hh_output include/net/neighbour.h:499 [inline] neigh_output include/net/neighbour.h:508 [inline] ip6_finish_output2+0x911/0x1700 net/ipv6/ip6_output.c:117 __ip6_finish_output net/ipv6/ip6_output.c:182 [inline] __ip6_finish_output+0x4c1/0xe10 net/ipv6/ip6_output.c:161 ip6_finish_output+0x35/0x200 net/ipv6/ip6_output.c:192 NF_HOOK_COND include/linux/netfilter.h:290 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:215 dst_output include/net/dst.h:448 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] NF_HOOK include/linux/netfilter.h:295 [inline] ip6_xmit+0x127e/0x1eb0 net/ipv6/ip6_output.c:320 inet6_csk_xmit+0x358/0x630 net/ipv6/inet6_connection_sock.c:135 dccp_transmit_skb+0x973/0x12c0 net/dccp/output.c:138 dccp_send_reset+0x21b/0x2b0 net/dccp/output.c:535 dccp_finish_passive_close net/dccp/proto.c:123 [inline] dccp_finish_passive_close+0xed/0x140 net/dccp/proto.c:118 dccp_terminate_connection net/dccp/proto.c:958 [inline] dccp_close+0xb3c/0xe60 net/dccp/proto.c:1028 inet_release+0x12e/0x280 net/ipv4/af_inet.c:431 inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478 __sock_release+0xcd/0x280 net/socket.c:599 sock_close+0x18/0x20 net/socket.c:1258 __fput+0x288/0x920 fs/file_table.c:280 task_work_run+0xdd/0x1a0 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28net: sched: prevent invalid Scell_log shift countRandy Dunlap1-1/+1
Check Scell_log shift size in red_check_params() and modify all callers of red_check_params() to pass Scell_log. This prevents a shift out-of-bounds as detected by UBSAN: UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22 shift exponent 72 is too large for 32-bit type 'int' Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com Cc: Nogah Frankel <nogahf@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds1-1/+1
Pull networking updates from David Miller: 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan. 2) Support UDP segmentation in code TSO code, from Eric Dumazet. 3) Allow flashing different flash images in cxgb4 driver, from Vishal Kulkarni. 4) Add drop frames counter and flow status to tc flower offloading, from Po Liu. 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni. 6) Various new indirect call avoidance, from Eric Dumazet and Brian Vazquez. 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from Yonghong Song. 8) Support querying and setting hardware address of a port function via devlink, use this in mlx5, from Parav Pandit. 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson. 10) Switch qca8k driver over to phylink, from Jonathan McDowell. 11) In bpftool, show list of processes holding BPF FD references to maps, programs, links, and btf objects. From Andrii Nakryiko. 12) Several conversions over to generic power management, from Vaibhav Gupta. 13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry Yakunin. 14) Various https url conversions, from Alexander A. Klimov. 15) Timestamping and PHC support for mscc PHY driver, from Antoine Tenart. 16) Support bpf iterating over tcp and udp sockets, from Yonghong Song. 17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov. 18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan. 19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several drivers. From Luc Van Oostenryck. 20) XDP support for xen-netfront, from Denis Kirjanov. 21) Support receive buffer autotuning in MPTCP, from Florian Westphal. 22) Support EF100 chip in sfc driver, from Edward Cree. 23) Add XDP support to mvpp2 driver, from Matteo Croce. 24) Support MPTCP in sock_diag, from Paolo Abeni. 25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic infrastructure, from Jakub Kicinski. 26) Several pci_ --> dma_ API conversions, from Christophe JAILLET. 27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel. 28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki. 29) Refactor a lot of networking socket option handling code in order to avoid set_fs() calls, from Christoph Hellwig. 30) Add rfc4884 support to icmp code, from Willem de Bruijn. 31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei. 32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin. 33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin. 34) Support TCP syncookies in MPTCP, from Flowian Westphal. 35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano Brivio. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits) net: thunderx: initialize VF's mailbox mutex before first usage usb: hso: remove bogus check for EINPROGRESS usb: hso: no complaint about kmalloc failure hso: fix bailout in error case of probe ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM selftests/net: relax cpu affinity requirement in msg_zerocopy test mptcp: be careful on subflow creation selftests: rtnetlink: make kci_test_encap() return sub-test result selftests: rtnetlink: correct the final return value for the test net: dsa: sja1105: use detected device id instead of DT one on mismatch tipc: set ub->ifindex for local ipv6 address ipv6: add ipv6_dev_find() net: openvswitch: silence suspicious RCU usage warning Revert "vxlan: fix tos value before xmit" ptp: only allow phase values lower than 1 period farsync: switch from 'pci_' to 'dma_' API wan: wanxl: switch from 'pci_' to 'dma_' API hv_netvsc: do not use VF device if link is down dpaa2-eth: Fix passing zero to 'PTR_ERR' warning net: macb: Properly handle phylink on at91sam9x ...
2020-07-16Revert "net: sched: Pass root lock to Qdisc_ops.enqueue"Petr Machata1-1/+1
This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-07-16treewide: Remove uninitialized_var() usageKees Cook1-1/+1
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-07net/sched: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29net: sched: Pass root lock to Qdisc_ops.enqueuePetr Machata1-1/+1
A following patch introduces qevents, points in qdisc algorithm where packet can be processed by user-defined filters. Should this processing lead to a situation where a new packet is to be enqueued on the same port, holding the root lock would lead to deadlocks. To solve the issue, qevent handler needs to unlock and relock the root lock when necessary. To that end, add the root lock argument to the qdisc op enqueue, and propagate throughout. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27sch_sfq: validate silly quantum valuesEric Dumazet1-0/+9
syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23net/flow_dissector: switch to siphashEric Dumazet1-6/+8
UDP IPv6 packets auto flowlabels are using a 32bit secret (static u32 hashrnd in net/core/flow_dissector.c) and apply jhash() over fields known by the receivers. Attackers can easily infer the 32bit secret and use this information to identify a device and/or user, since this 32bit secret is only set at boot time. Really, using jhash() to generate cookies sent on the wire is a serious security concern. Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be a dead end. Trying to periodically change the secret (like in sch_sfq.c) could change paths taken in the network for long lived flows. Let's switch to siphash, as we did in commit df453700e8d8 ("inet: switch IP ID generator to siphash") Using a cryptographically strong pseudo random function will solve this privacy issue and more generally remove other weak points in the stack. Packet schedulers using skb_get_hash_perturb() benefit from this change. Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default") Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Berger <jonathann1@walla.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17net_sched: unset TCQ_F_CAN_BYPASS when adding filtersCong Wang1-2/+0
For qdisc's that support TC filters and set TCQ_F_CAN_BYPASS, notably fq_codel, it makes no sense to let packets bypass the TC filters we setup in any scenario, otherwise our packets steering policy could not be enforced. This can be reproduced easily with the following script: ip li add dev dummy0 type dummy ifconfig dummy0 up tc qd add dev dummy0 root fq_codel tc filter add dev dummy0 parent 8001: protocol arp basic action mirred egress redirect dev lo tc filter add dev dummy0 parent 8001: protocol ip basic action mirred egress redirect dev lo ping -I dummy0 192.168.112.1 Without this patch, packets are sent directly to dummy0 without hitting any of the filters. With this patch, packets are redirected to loopback as expected. This fix is not perfect, it only unsets the flag but does not set it back because we have to save the information somewhere in the qdisc if we really want that. Note, both fq_codel and sfq clear this flag in their ->bind_tcf() but this is clearly not sufficient when we don't use any class ID. Fixes: 23624935e0c4 ("net_sched: TCQ_F_CAN_BYPASS generalization") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner1-5/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-21net: sch: api: add extack support in tcf_block_getAlexander Aring1-1/+1
This patch adds extack support for the function tcf_block_get which is a common used function in the tc subsystem. Callers which are interested in the receiving error can assign extack to get a more detailed information why tcf_block_get failed. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21net: sched: sch: add extack for block callbackAlexander Aring1-1/+2
This patch adds extack support for block callback to prepare per-qdisc specific changes for extack. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21net: sched: sch: add extack for init callbackAlexander Aring1-1/+2
This patch adds extack support for init callback to prepare per-qdisc specific changes for extack. Cc: David Ahern <dsahern@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05net_sched: red: Avoid illegal valuesNogah Frankel1-0/+3
Check the qmin & qmax values doesn't overflow for the given Wlog value. Check that qmin <= qmax. Fixes: a783474591f2 ("[PKT_SCHED]: Generic RED layer") Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-28sch_sfq: fix null pointer dereference at timer expirationPaolo Abeni1-0/+1
While converting sch_sfq to use timer_setup(), the commit cdeabbb88134 ("net: sched: Convert timers to use timer_setup()") forgot to initialize the 'sch' field. As a result, the timer callback tries to dereference a NULL pointer, and the kernel does oops. Fix it initializing such field at qdisc creation time. Fixes: cdeabbb88134 ("net: sched: Convert timers to use timer_setup()") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22net: sched: mark expected switch fall-throughsGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18net: sched: Convert timers to use timer_setup()Kees Cook1-5/+5
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Add pointer back to Qdisc. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16net: sched: store Qdisc pointer in struct blockJiri Pirko1-1/+1
Prepare for removal of tp->q and store Qdisc pointer in the block structure. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+3
Three cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30sch_sfq: fix null pointer dereference on init failureNikolay Aleksandrov1-3/+3
Currently only a memory allocation failure can lead to this, so let's initialize the timer first. Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28sched: sfq: drop packets after root qdisc lock is releasedGao Feng1-7/+13
The commit 520ac30f4551 ("net_sched: drop packets after root qdisc lock is released) made a big change of tc for performance. But there are some points which are not changed in SFQ enqueue operation. 1. Fail to find the SFQ hash slot; 2. When the queue is full; Now use qdisc_drop instead free skb directly. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: remove tc class reference countingWANG Cong1-5/+4
For TC classes, their ->get() and ->put() are always paired, and the reference counting is completely useless, because: 1) For class modification and dumping paths, we already hold RTNL lock, so all of these ->get(),->change(),->put() are atomic. 2) For filter bindiing/unbinding, we use other reference counter than this one, and they should have RTNL lock too. 3) For ->qlen_notify(), it is special because it is called on ->enqueue() path, but we already hold qdisc tree lock there, and we hold this tree lock when graft or delete the class too, so it should not be gone or changed until we release the tree lock. Therefore, this patch removes ->get() and ->put(), but: 1) Adds a new ->find() to find the pointer to a class by classid, no refcnt. 2) Move the original class destroy upon the last refcnt into ->delete(), right after releasing tree lock. This is fine because the class is already removed from hash when holding the lock. For those who also use ->put() as ->unbind(), just rename them to reflect this change. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15net_sched/sfq: update hierarchical backlog when drop packetKonstantin Khlebnikov1-1/+4
When sfq_enqueue() drops head packet or packet from another queue it have to update backlog at upper qdiscs too. Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06net: sched: introduce a TRAP control actionJiri Pirko1-0/+1
There is need to instruct the HW offloaded path to push certain matched packets to cpu/kernel for further analysis. So this patch introduces a new TRAP control action to TC. For kernel datapath, this action does not make much sense. So with the same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just dropped in the datapath (and virtually ejected to an upper level, which does not exist in case of kernel). Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Yotam Gigi <yotamg@mellanox.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17net: sched: introduce tcf block infractructureJiri Pirko1-5/+10
Currently, the filter chains are direcly put into the private structures of qdiscs. In order to be able to have multiple chains per qdisc and to allow filter chains sharing among qdiscs, there is a need for common object that would hold the chains. This introduces such object and calls it "tcf_block". Helpers to get and put the blocks are provided to be called from individual qdisc code. Also, the original filter_list pointers are left in qdisc privs to allow the entry into tcf_block processing without any added overhead of possible multiple pointer dereference on fast path. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17net: sched: move tc_classify function to cls_api.cJiri Pirko1-1/+1
Move tc_classify function to cls_api.c where it belongs, rename it to fit the namespace. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08treewide: use kv[mz]alloc* rather than opencoded variantsMichal Hocko1-5/+1
There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-24net_sched: use setup_deferrable_timerGeliang Tang1-3/+2
Use setup_deferrable_timer() instead of init_timer_deferrable() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-11net_sched: fix error recovery at qdisc creationEric Dumazet1-1/+2
Dmitry reported uses after free in qdisc code [1] The problem here is that ops->init() can return an error. qdisc_create_dflt() then call ops->destroy(), while qdisc_create() does _not_ call it. Four qdisc chose to call their own ops->destroy(), assuming their caller would not. This patch makes sure qdisc_create() calls ops->destroy() and fixes the four qdisc to avoid double free. [1] BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440 Read of size 8 by task syz-executor2/5030 CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400 ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898 ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0 Call Trace: [<ffffffff81bbbed4>] __dump_stack lib/dump_stack.c:15 [inline] [<ffffffff81bbbed4>] dump_stack+0x6c/0x98 lib/dump_stack.c:51 [<ffffffff816682b1>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158 [<ffffffff81668524>] print_address_description mm/kasan/report.c:196 [inline] [<ffffffff81668524>] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285 [<ffffffff81668953>] kasan_report mm/kasan/report.c:305 [inline] [<ffffffff81668953>] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326 [<ffffffff82527b02>] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 [<ffffffff82524bdd>] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953 [<ffffffff82524e30>] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848 [<ffffffff8252550d>] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline] [<ffffffff8252550d>] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064 [<ffffffff824b1db1>] __dev_open+0x221/0x320 net/core/dev.c:1403 [<ffffffff824b24ce>] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858 [<ffffffff824b27de>] dev_change_flags+0x8e/0x140 net/core/dev.c:6926 [<ffffffff824f5bf6>] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260 [<ffffffff824f61fa>] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546 [<ffffffff82430509>] sock_do_ioctl+0x99/0xb0 net/socket.c:879 [<ffffffff82430d30>] sock_ioctl+0x2a0/0x390 net/socket.c:958 [<ffffffff816f3b68>] vfs_ioctl fs/ioctl.c:44 [inline] [<ffffffff816f3b68>] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611 [<ffffffff816f41a4>] SYSC_ioctl fs/ioctl.c:626 [inline] [<ffffffff816f41a4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617 [<ffffffff8123e357>] entry_SYSCALL_64_fastpath+0x12/0x17 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10sched: move tcf_proto_destroy and tcf_destroy_chain helpers into cls_apiJiri Pirko1-0/+1
Creation is done in this file, move destruction to be at the same place. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25net_sched: drop packets after root qdisc lock is releasedEric Dumazet1-4/+4
Qdisc performance suffers when packets are dropped at enqueue() time because drops (kfree_skb()) are done while qdisc lock is held, delaying a dequeue() draining the queue. Nominal throughput can be reduced by 50 % when this happens, at a time we would like the dequeue() to proceed as fast as possible. Even FQ is vulnerable to this problem, while one of FQ goals was to provide some flow isolation. This patch adds a 'struct sk_buff **to_free' parameter to all qdisc->enqueue(), and in qdisc_drop() helper. I measured a performance increase of up to 12 %, but this patch is a prereq so that future batches in enqueue() can fly. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net_sched: sch_fq: defer skb freeingEric Dumazet1-1/+1
sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08sched: remove qdisc->dropFlorian Westphal1-1/+0
after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no more callers of ->drop() outside of other ->drop functions, i.e. nothing calls them. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-29net_sched: update hierarchical backlog tooWANG Cong1-7/+9
When the bottom qdisc decides to, for example, drop some packet, it calls qdisc_tree_decrease_qlen() to update the queue length for all its ancestors, we need to update the backlog too to keep the stats on root qdisc accurate. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27net: sched: consolidate tc_classify{,_compat}Daniel Borkmann1-1/+1
For classifiers getting invoked via tc_classify(), we always need an extra function call into tc_classify_compat(), as both are being exported as symbols and tc_classify() itself doesn't do much except handling of reclassifications when tp->classify() returned with TC_ACT_RECLASSIFY. CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(), all others use tc_classify(). When tc actions are being configured out in the kernel, tc_classify() effectively does nothing besides delegating. We could spare this layer and consolidate both functions. pktgen on single CPU constantly pushing skbs directly into the netif_receive_skb() path with a dummy classifier on ingress qdisc attached, improves slightly from 22.3Mpps to 23.1Mpps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15net_sched: fix a use-after-free in sfqWANG Cong1-1/+1
Fixes: 25331d6ce42b ("net: sched: implement qstat helper routines") Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04sched: Call skb_get_hash_perturb in sch_sfqTom Herbert1-25/+2
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then jhash by hand. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: enable per cpu qstatsJohn Fastabend1-1/+1
After previous patches to simplify qstats the qstats can be made per cpu with a packed union in Qdisc struct. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: restrict use of qstats qlenJohn Fastabend1-1/+1
This removes the use of qstats->qlen variable from the classifiers and makes it an explicit argument to gnet_stats_copy_queue(). The qlen represents the qdisc queue length and is packed into the qstats at the last moment before passnig to user space. By handling it explicitely we avoid, in the percpu stats case, having to figure out which per_cpu variable to put it in. It would probably be best to remove it from qstats completely but qstats is a user space ABI and can't be broken. A future patch could make an internal only qstats structure that would avoid having to allocate an additional u32 variable on the Qdisc struct. This would make the qstats struct 128bits instead of 128+32. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: implement qstat helper routinesJohn Fastabend1-8/+9
This adds helpers to manipulate qstats logic and replaces locations that touch the counters directly. This simplifies future patches to push qstats onto per cpu counters. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13net: rcu-ify tcf_protoJohn Fastabend1-4/+7
rcu'ify tcf_proto this allows calling tc_classify() without holding any locks. Updaters are protected by RTNL. This patch prepares the core net_sched infrastracture for running the classifier/action chains without holding the qdisc lock however it does nothing to ensure cls_xxx and act_xxx types also work without locking. Additional patches are required to address the fall out. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09net_sched: sfq: remove unused macroFlorian Westphal1-5/+0
not used anymore since ddecf0f (net_sched: sfq: add optional RED on top of SFQ). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-05net: use the new API kvfree()WANG Cong1-6/+1
It is available since v3.15-rc5. Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14net: replace macros net_random and net_srandom with direct calls to prandomAruna-Hewapathirane1-3/+3
This patch removes the net_random and net_srandom macros and replaces them with direct calls to the prandom ones. As new commits only seem to use prandom_u32 there is no use to keep them around. This change makes it easier to grep for users of prandom_u32. Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10net_sched: sfq: put sfq_unlink in a do - while loopYang Yingliang1-4/+6
Macros with multiple statements should be enclosed in a do - while loop Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-01pkt_sched: Stop using NLA_PUT*().David S. Miller1-1/+2
These macros contain a hidden goto, and are thus extremely error prone and make code hard to audit. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-16sch_sfq: revert dont put new flow at the end of flowsEric Dumazet1-1/+5
This reverts commit d47a0ac7b6 (sch_sfq: dont put new flow at the end of flows) As Jesper found out, patch sounded great but has bad side effects. In stress situation, pushing new flows in front of the queue can prevent old flows doing any progress. Packets can stay in SFQ queue for unlimited amount of time. It's possible to add heuristics to limit this problem, but this would add complexity outside of SFQ scope. A more sensible answer to Dave Taht concerns (who reported the issued I tried to solve in original commit) is probably to use a qdisc hierarchy so that high prio packets dont enter a potentially crowded SFQ qdisc. Reported-by: Jesper Dangaard Brouer <jdb@comx.dk> Cc: Dave Taht <dave.taht@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-09net: Make qdisc_skb_cb upper size bound explicit.David S. Miller1-3/+2
Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside of other data structures. This is intended to be used by IPoIB so that it can remember addressing information stored at hard_header_ops->create() time that it can fetch when the packet gets to the transmit routine. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12net_sched: sfq: add optional RED on top of SFQEric Dumazet1-16/+130
Adds an optional Random Early Detection on each SFQ flow queue. Traditional SFQ limits count of packets, while RED permits to also control number of bytes per flow, and adds ECN capability as well. 1) We dont handle the idle time management in this RED implementation, since each 'new flow' begins with a null qavg. We really want to address backlogged flows. 2) if headdrop is selected, we try to ecn mark first packet instead of currently enqueued packet. This gives faster feedback for tcp flows compared to traditional RED [ marking the last packet in queue ] Example of use : tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \ limit 3000 headdrop flows 512 divisor 16384 \ redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop flows 512/16384 divisor 16384 ewma 6 min 8000b max 60000b probability 0.2 ecn prob_mark 0 prob_mark_head 4876 prob_drop 6131 forced_mark 0 forced_mark_head 0 forced_drop 0 Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007 requeues 0) rate 99483Kbit 8219pps backlog 689392b 456p requeues 0 In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled flows, we can see number of packets CE marked is smaller than number of drops (for non ECN flows) If same test is run, without RED, we can check backlog is much bigger. qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop flows 512/16384 divisor 16384 Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0) rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Dave Taht <dave.taht@gmail.com> Tested-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05net_sched: sfq: extend limitsEric Dumazet1-58/+117
SFQ as implemented in Linux is very limited, with at most 127 flows and limit of 127 packets. [ So if 127 flows are active, we have one packet per flow ] This patch brings to SFQ following features to cope with modern needs. - Ability to specify a smaller per flow limit of inflight packets. (default value being at 127 packets) - Ability to have up to 65408 active flows (instead of 127) - Ability to have head drops instead of tail drops (to drop old packets from a flow) Example of use : No more than 20 packets per flow, max 8000 flows, max 20000 packets in SFQ qdisc, hash table of 65536 slots. tc qdisc add ... sfq \ flows 8000 \ depth 20 \ headdrop \ limit 20000 \ divisor 65536 Ram usage : 2 bytes per hash table entry (instead of previous 1 byte/entry) 32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much better cache hit ratio. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-04net_sched: sfq: always randomize hash perturbationEric Dumazet1-6/+6
SFQ q->perturbation is used in sfq_hash() as an input to Jenkins hash. We currently randomize this 32bit value only if a perturbation timer is setup. Its much better to always initialize it to defeat attackers, or else they can predict very well what kind of packets they have to forge to hit a particular flow. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-04net_sched: sfq: fix mem alloc error recoveryEric Dumazet1-19/+33
Since commit 817fb15dfd98 (net_sched: sfq: allow divisor to be a parameter), we can leave perturbation timer armed if a memory allocation error aborts sfq_init(). Memory containing active struct timer_list is freed and kernel can crash. Call sfq_destroy() from sfq_init() to properly dismantle qdisc. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03sch_sfq: dont put new flow at the end of flowsEric Dumazet1-1/+1
SFQ enqueue algo puts a new flow _behind_ all pre-existing flows in the circular list. In fact this is probably an old SFQ implementation bug. 100 Mbits = ~8333 full frames per second, or ~8 frames per ms. With 50 flows, it means your "new flow" will have to wait 50 packets being sent before its own packet. Thats the ~6ms. We certainly can change SFQ to give a priority advantage to new flows, so that next dequeued packet is taken from a new flow, not an old one. Reported-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-21sch_sfq: rehash queues in perturb timerEric Dumazet1-6/+81
A known Out Of Order (OOO) problem hurts SFQ when timer changes perturbation value, since all new packets delivered to SFQ enqueue might end on different slots than previous in-flight packets. With round robin delivery, we can thus deliver packets in a different order. Since SFQ is limited to small amount of in-flight packets, we can rehash packets so that this OOO problem is fixed. This rehashing is performed only if internal flow classifier is in use. We now store in skb->cb[] the "struct flow_keys" so that we dont call skb_flow_dissect() again while rehashing. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29sch_sfq: use skb_flow_dissect()Eric Dumazet1-55/+10
Instead of using a custom flow dissector, use skb_flow_dissect() and benefit from tunnelling support. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01sch_sfq: fix sfq_enqueue()Eric Dumazet1-1/+6
commit 8efa88540635 (sch_sfq: avoid giving spurious NET_XMIT_CN signals) forgot to call qdisc_tree_decrease_qlen() to signal upper levels that a packet (from another flow) was dropped, leading to various problems. With help from Michal Soltys and Michal Pokrywka, who did a bisection. Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=39372 Debian ref: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631945 Reported-by: Lucas Bocchi <lucas.bocchi@gmail.com> Reported-and-bisected-by: Michal Pokrywka <wolfmoon@o2.pl> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Michal Soltys <soltys@ziu.info> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21ip: introduce ip_is_fragment helper inline functionPaul Gortmaker1-1/+1
There are enough instances of this: iph->frag_off & htons(IP_MF | IP_OFFSET) that a helper function is probably warranted. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-25sch_sfq: fix peek() implementationEric Dumazet1-13/+1
Since commit eeaeb068f139 (sch_sfq: allow big packets and be fair), sfq_peek() can return a different skb that would be normally dequeued by sfq_dequeue() [ if current slot->allot is negative ] Use generic qdisc_peek_dequeued() instead of custom implementation, to get consistent result. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-23sch_sfq: avoid giving spurious NET_XMIT_CN signalsEric Dumazet1-2/+6
While chasing a possible net_sched bug, I found that IP fragments have litle chance to pass a congestioned SFQ qdisc : - Say SFQ qdisc is full because one flow is non responsive. - ip_fragment() wants to send two fragments belonging to an idle flow. - sfq_enqueue() queues first packet, but see queue limit reached : - sfq_enqueue() drops one packet from 'big consumer', and returns NET_XMIT_CN. - ip_fragment() cancel remaining fragments. This patch restores fairness, making sure we return NET_XMIT_CN only if we dropped a packet from the same flow. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-22inet: constify ip headers and in6_addrEric Dumazet1-1/+1
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02sfq: deadlock in error pathstephen hemminger1-4/+5
The change to allow divisor to be a parameter (in 2.6.38-rc1) commit 817fb15dfd988d8dda916ee04fa506f0c466b9d6 introduced a possible deadlock caught by sparse. The scheduler tree lock was left locked in the case of an incorrect divisor value. Simplest fix is to move test outside of lock which also solves problem of partial update. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24Merge branch 'master' of ↵David S. Miller1-3/+2
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/sched/sch_hfsc.c net/sched/sch_htb.c net/sched/sch_tbf.c
2011-01-21net_sched: TCQ_F_CAN_BYPASS generalizationEric Dumazet1-0/+6
Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in __dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfq SFQ is special because it can have external classifiers, and in these cases, we cannot bypass queue discipline (packet could be dropped by classifier) without admin asking it, or further changes. Its worth doing this, especially for SFQ, avoiding dirtying memory in case no packets are already waiting in queue. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20net_sched: accurate bytes/packets stats/ratesEric Dumazet1-3/+2
In commit 44b8288308ac9d (net_sched: pfifo_head_drop problem), we fixed a problem with pfifo_head drops that incorrectly decreased sch->bstats.bytes and sch->bstats.packets Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a previously enqueued packet, and bstats cannot be changed, so bstats/rates are not accurate (over estimated) This patch changes the qdisc_bstats updates to be done at dequeue() time instead of enqueue() time. bstats counters no longer account for dropped frames, and rates are more correct, since enqueue() bursts dont have effect on dequeue() rate. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20net_sched: sfq: allow divisor to be a parameterEric Dumazet1-12/+30
SFQ currently uses a 1024 slots hash table, and its internal structure (sfq_sched_data) allocation needs order-1 page on x86_64 Allow tc command to specify a divisor value (hash table size), between 1 and 65536. If no value is provided, assume the 1024 default size. This allows admins to setup smaller (or bigger) SFQ for specific needs. This also brings back sfq_sched_data allocations to order-0 ones, saving 3KB per SFQ qdisc. Jesper uses ~55.000 SFQ in one machine, this patch should free 165 MB of memory. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Jesper Dangaard Brouer <hawk@diku.dk> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19net_sched: cleanupsEric Dumazet1-10/+8
Cleanup net/sched code to current CodingStyle and practices. Reduce inline abuse Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-10net_sched: factorize qdisc stats handlingEric Dumazet1-2/+1
HTB takes into account skb is segmented in stats updates. Generalize this to all schedulers. They should use qdisc_bstats_update() helper instead of manipulating bstats.bytes and bstats.packets Add bstats_update() helper too for classes that use gnet_stats_basic_packed fields. Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no stab is setup on qdisc. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-31sfq: fix slot_dequeue_head()Eric Dumazet1-2/+4
slot_dequeue_head() should make sure slot skb chain is correct in both ways, or we can crash if all possible flows are in use. Jarek pointed out slot_queue_init() can now be done in sfq_init() once, instead each time a flow is setup. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-31sch_sfq: allow big packets and be fairEric Dumazet1-7/+19
SFQ is currently 'limited' to small packets, because it uses a 15bit allotment number per flow. Introduce a scale by 8, so that we can handle full size TSO/GRO packets. Use appropriate handling to make sure allot is positive before a new packet is dequeued, so that fairness is respected. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Jarek Poplawski <jarkao2@gmail.com> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-22sfq: fix sfq class stats handlingEric Dumazet1-5/+11
sfq_walk() runs without qdisc lock. By the time it selects a non empty hash slot and sfq_dump_class_stats() is run (with lock held), slot might have been freed : We then access q->slots[SFQ_EMPTY_SLOT], out of bounds, and crash in slot_queue_walk() On previous kernels, bug is here but out of bounds qs[SFQ_DEPTH] and allot[SFQ_DEPTH] are located in struct sfq_sched_data, so no illegal memory access happens, only possibly wrong data reported to user. Also, slot_dequeue_tail() should make sure slot skb chain is correctly terminated, or sfq_dump_class_stats() can access freed skbs. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-20net_sched: sch_sfq: better struct layoutsEric Dumazet1-98/+162
Here is a respin of patch. I'll send a short patch to make SFQ more fair in presence of large packets as well. Thanks [PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts This patch shrinks sizeof(struct sfq_sched_data) from 0x14f8 (or more if spinlocks are bigger) to 0x1180 bytes, and reduce text size as well. text data bss dec hex filename 4821 152 0 4973 136d old/net/sched/sch_sfq.o 4627 136 0 4763 129b new/net/sched/sch_sfq.o All data for a slot/flow is now grouped in a compact and cache friendly structure, instead of being spreaded in many different points. struct sfq_slot { struct sk_buff *skblist_next; struct sk_buff *skblist_prev; sfq_index qlen; /* number of skbs in skblist */ sfq_index next; /* next slot in sfq chain */ struct sfq_head dep; /* anchor in dep[] chains */ unsigned short hash; /* hash value (index in ht[]) */ short allot; /* credit for this slot */ }; Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jarek Poplawski <jarkao2@gmail.com> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-20Merge branch 'master' of ↵David S. Miller1-12/+8
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2010-12-20net_sched: sch_sfq: fix allot handlingEric Dumazet1-12/+8
When deploying SFQ/IFB here at work, I found the allot management was pretty wrong in sfq, even changing allot from short to int... We should init allot for each new flow, not using a previous value found in slot. Before patch, I saw bursts of several packets per flow, apparently denying the default "quantum 1514" limit I had on my SFQ class. class sfq 11:1 parent 11: (dropped 0, overlimits 0 requeues 0) backlog 0b 7p requeues 0 allot 11546 class sfq 11:46 parent 11: (dropped 0, overlimits 0 requeues 0) backlog 0b 1p requeues 0 allot -23873 class sfq 11:78 parent 11: (dropped 0, overlimits 0 requeues 0) backlog 0b 5p requeues 0 allot 11393 After patch, better fairness among each flow, allot limit being respected, allot is positive : class sfq 11:e parent 11: (dropped 0, overlimits 0 requeues 86) backlog 0b 3p requeues 86 allot 596 class sfq 11:94 parent 11: (dropped 0, overlimits 0 requeues 0) backlog 0b 3p requeues 0 allot 1468 class sfq 11:a4 parent 11: (dropped 0, overlimits 0 requeues 0) backlog 0b 4p requeues 0 allot 650 class sfq 11:bb parent 11: (dropped 0, overlimits 0 requeues 0) backlog 0b 3p requeues 0 allot 596 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-20net_sched: sch_sfq: add backlog info in sfq_dump_class_stats()Eric Dumazet1-1/+6
We currently return for each active SFQ slot the number of packets in queue. We can also give number of bytes accounted for these packets. tc -s class show dev ifb0 Before patch : class sfq 11:3d9 parent 11: (dropped 0, overlimits 0 requeues 0) backlog 0b 3p requeues 0 allot 1266 After patch : class sfq 11:3e4 parent 11: (dropped 0, overlimits 0 requeues 0) backlog 4380b 3p requeues 0 allot 1212 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19net_sched: sch_sfq: use proto_ports_offset() to support AH messageChangli Gao1-17/+16
Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-11pkt_sched: Fix sch_sfq vs tc_modify_qdisc oopsJarek Poplawski1-0/+6
sch_sfq as a classful qdisc needs the .leaf handler. Otherwise, there is an oops possible in tc_modify_qdisc()/check_loop(). Fixes commit 7d2681a6ff4f9ab5e48d02550b4c6338f1638998 Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-10net: Use NET_XMIT_SUCCESS where possible.Ben Greear1-1/+1
This is based on work originally done by Patric McHardy. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-10pkt_sched: sch_sfq: Add dummy unbind_tcf and put handles. Was: [PATCH] sfq: ↵Jarek Poplawski1-0/+6
add dummy bind/unbind handles Add dummy .unbind_tcf and .put qdisc class ops for easier verification. (All other schedulers have it like this.) Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-07pkt_sched: Fix sch_sfq vs tcf_bind_filter oopsJarek Poplawski1-0/+7
Since there was added ->tcf_chain() method without ->bind_tcf() to sch_sfq class options, there is oops when a filter is added with the classid parameter. Fixes commit 7d2681a6ff4f9ab5e48d02550b4c6338f1638998 netdev thread: null pointer at cls_api.c Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Reported-by: Franchoze Eric <franchoze@yandex.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-04sch_sfq: add sanity check for the packet lengthChangli Gao1-9/+20
The packet length should be checked before the packet data is dereferenced. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20net: Fix various endianness glitchesEric Dumazet1-5/+5
Sparse can help us find endianness bugs, but we need to make some cleanups to be able to more easily spot real bugs. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo1-0/+1
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2009-09-06net_sched: make cls_ops->change and cls_ops->delete optionalPatrick McHardy1-7/+0
Some schedulers don't support creating, changing or deleting classes. Make the respective callbacks optionally and consistently return -EOPNOTSUPP for unsupported operations, instead of currently either -EOPNOTSUPP, -ENOSYS or no error. In case of sch_prio and sch_multiq, the removed operations additionally checked for an invalid class. This is not necessary since the class argument can only orginate from ->get() or in case of ->change is 0 for creation of new classes, in which case ->change() incorrectly returned -ENOENT. As a side-effect, this patch fixes a possible (root-only) NULL pointer function call in sch_ingress, which didn't implement a so far mandatory ->delete() operation. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03net: skb->dst accessorsEric Dumazet1-1/+1
Define three accessors to get/set dst attached to a skb struct dst_entry *skb_dst(const struct sk_buff *skb) void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-08remove lots of double-semicolonsFernando Carrijo1-1/+1
Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Mark Fasheh <mfasheh@suse.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-21pkt_sched: Annotate uninitialized var in sfq_enqueue()Jarek Poplawski1-1/+1
Some gcc versions warn that ret may be used uninitialized in sfq_enqueue(). It's a false positive, so let's annotate this. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-13pkt_sched: Remove qdisc->ops->requeue() etc.Jarek Poplawski1-63/+0
After implementing qdisc->ops->peek() and changing sch_netem into classless qdisc there are no more qdisc->ops->requeue() users. This patch removes this method with its wrappers (qdisc_requeue()), and also unused qdisc->requeue structure. There are a few minor fixes of warnings (htb_enqueue()) and comments btw. The idea to kill ->requeue() and a similar patch were first developed by David S. Miller. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31pkt_sched: Add ->peek() methods for fifo, prio and SFQ qdiscs.Patrick McHardy1-0/+12
From: Patrick McHardy <kaber@trash.net> Just as a demonstration how easy adding a peek operation to the work-conserving qdiscs actually is. It doesn't need to keep or change any internal state in many cases thanks to the guarantee that the packet will either be dequeued or, if another packet arrives, the upper qdisc will immediately ->peek again to reevaluate the state. (This is only slightly modified Patrick's patch.) Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-20net: Use hton[sl]() instead of __constant_hton[sl]() where applicableArnaldo Carvalho de Melo1-2/+2
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-04net_sched: Add qdisc __NET_XMIT_BYPASS flagJarek Poplawski1-3/+3
Patrick McHardy <kaber@trash.net> noticed that it would be nice to handle NET_XMIT_BYPASS by NET_XMIT_SUCCESS with an internal qdisc flag __NET_XMIT_BYPASS and to remove the mapping from dev_queue_xmit(). David Miller <davem@davemloft.net> spotted a serious bug in the first version of this patch. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-04net_sched: Add qdisc __NET_XMIT_STOLEN flagJarek Poplawski1-1/+1
Patrick McHardy <kaber@trash.net> noticed: "The other problem that affects all qdiscs supporting actions is TC_ACT_QUEUED/TC_ACT_STOLEN getting mapped to NET_XMIT_SUCCESS even though the packet is not queued, corrupting upper qdiscs' qlen counters." and later explained: "The reason why it translates it at all seems to be to not increase the drops counter. Within a single qdisc this could be avoided by other means easily, upper qdiscs would still increase the counter when we return anything besides NET_XMIT_SUCCESS though. This means we need a new NET_XMIT return value to indicate this to the upper qdiscs. So I'd suggest to introduce NET_XMIT_STOLEN, return that to upper qdiscs and translate it to NET_XMIT_SUCCESS in dev_queue_xmit, similar to NET_XMIT_BYPASS." David Miller <davem@davemloft.net> noticed: "Maybe these NET_XMIT_* values being passed around should be a set of bits. They could be composed of base meanings, combined with specific attributes. So you could say "NET_XMIT_DROP | __NET_XMIT_NO_DROP_COUNT" The attributes get masked out by the top-level ->enqueue() caller, such that the base meanings are the only thing that make their way up into the stack. If it's only about communication within the qdisc tree, let's simply code it that way." This patch is trying to realize these ideas. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-26Revert "pkt_sched: sch_sfq: dump a real number of flows"David S. Miller1-8/+1
This reverts commit f867e6af94239a04ec23aeec2fcda5aa58e41db7. Based upon discussions between Jarek and Patrick McHardy this is field being set is more a config parameter than a statistic. And we should add a true statistic to provide this information if we really want it. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-23pkt_sched: sch_sfq: dump a real number of flowsJarek Poplawski1-1/+8
Dump the "flows" number according to the number of active flows instead of repeating the "limit". Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-20net_sched: Add accessor function for packet length for qdiscsJussi Kivilinna1-8/+8
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-08pkt_sched: Remove 'dev' member of struct Qdisc.David S. Miller1-2/+2
It can be obtained via the netdev_queue. So create a helper routine, qdisc_dev(), to make the transformations nicer looking. Now, qdisc_alloc() now no longer needs a net_device pointer argument. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-01net-sched: change tcf_destroy_chain() to clear start of filter listPatrick McHardy1-1/+1
Pass double tcf_proto pointers to tcf_destroy_chain() to make it clear the start of the filter list for more consistency. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-29sch_sfq: use del_timer_sync() in sfq_destroy()Jarek Poplawski1-1/+2
Let's delete timer reliably in sfq_destroy(). Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NET_SCHED]: sch_sfq: make internal queues visible as classesPatrick McHardy1-1/+40
Add support for dumping statistics and make internal queues visible as classes. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NET_SCHED]: sch_sfq: add support for external classifiersPatrick McHardy1-4/+91
Add support for external classifiers to allow using different flow hash functions similar to ESFQ. When no classifier is attached the built-in hash is used as before. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET_SCHED]: Convert packet schedulers from rtnetlink to new netlink APIPatrick McHardy1-6/+6
Convert packet schedulers to use the netlink API. Unfortunately a gradual conversion is not possible without breaking compilation in the middle or adding lots of casts, so this patch converts them all in one step. The patch has been mostly generated automatically with some minor edits to at least allow seperate conversion of classifiers and actions. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[PKT_SCHED] SFQ: whitespace cleanupStephen Hemminger1-17/+25
Add whitespace around operators, and add a few blank lines to improve readability. Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[PKT_SCHED] SFQ: use net_randomStephen Hemminger1-3/+3
SFQ doesn't need true random numbers, it is only using them to salt a hash. Therefore it is better to use net_random() and avoid any possible problems with depleting the entropy pool. Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[PKT_SCHED] SFQ: timer is deferrableStephen Hemminger1-1/+3
The perturbation timer used for re-keying can be deferred, it doesn't need to be deterministic. Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Move Qdisc_class_ops and Qdisc_ops in appropriate sections.Eric Dumazet1-1/+1
Qdisc_class_ops are const, and Qdisc_ops are mostly read. Using "const" and "__read_mostly" qualifiers helps to reduce false sharing. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET]: Convert init_timer into setup_timerPavel Emelyanov1-3/+1
Many-many code in the kernel initialized the timer->function and timer->data together with calling init_timer(timer). There is already a helper for this. Use it for networking code. The patch is HUGE, but makes the code 130 lines shorter (98 insertions(+), 228 deletions(-)). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-01[SFQ]: Remove artificial limitation for queue limit.Alexey Kuznetsov1-16/+31
This is followup to Patrick's patch. A little optimization to enqueue routine allows to remove artificial limitation on queue length. Plus, testing showed that hash function used by SFQ is too bad or even worse. It does not even sweep the whole range of hash values. Switched to Jenkins' hash. Signed-off-by: Alexey Kuznetsov <kaber@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-09-20[PKT_SCHED]: Fix 'SFQ qdisc crashes with limit of 2 packets'Alexey Kuznetsov1-5/+5
Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10[NET_SCHED]: Remove unnecessary includesPatrick McHardy1-16/+2
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NETLINK]: Use nlmsg_trim() where appropriateArnaldo Carvalho de Melo1-1/+2
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[SK_BUFF]: Convert skb->tail to sk_buff_data_tArnaldo Carvalho de Melo1-1/+1
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes on 64bit architectures, allowing us to combine the 4 bytes hole left by the layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4 64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN... :-) Many calculations that previously required that skb->{transport,network, mac}_header be first converted to a pointer now can be done directly, being meaningful as offsets or pointers. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[SK_BUFF]: Introduce ipv6_hdr(), remove skb->nh.ipv6hArnaldo Carvalho de Melo1-1/+1
Now the skb->nh union has just one member, .raw, i.e. it is just like the skb->mac union, strange, no? I'm just leaving it like that till the transport layer is done with, when we'll rename skb->mac.raw to skb->mac_header (or ->mac_header_offset?), ditto for ->{h,nh}. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iphArnaldo Carvalho de Melo1-1/+1
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-10[NET] SCHED: Fix whitespace errors.YOSHIFUJI Hideaki1-3/+3
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-08[NET]: Add UDPLITE support in a few missing spotsPatrick McHardy1-0/+2
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02[NET_SCHED]: Fix endless loops (part 2): "simple" qdiscsPatrick McHardy1-0/+3
Convert the "simple" qdiscs to use qdisc_tree_decrease_qlen() where necessary: - all graft operations - destruction of old child qdiscs in prio, red and tbf change operation - purging of queue in sfq change operation Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel1-1/+0
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-20[PKT_SCHED]: Keep backlog counter in sch_sfqPatrick McHardy1-0/+5
Keep backlog counter in SFQ qdisc to make it usable as child qdisc with RED. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-17[PKT_SCHED]: Handle SCTP/DCCP in sfq_hashPatrick McHardy1-0/+4
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-16Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds1-0/+497
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!