aboutsummaryrefslogtreecommitdiffstats
path: root/arch/powerpc/mm/numa.c
AgeCommit message (Collapse)AuthorFilesLines
2024-03-03powerpc: Stop using of_rootChristophe Leroy1-2/+4
Replace all usages of of_root by of_find_node_by_path("/") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231214103152.12269-5-mpe@ellerman.id.au
2023-08-24powerpc/pseries: Move VPHN constants into vphn.hMichael Ellerman1-0/+1
These don't have any particularly good reason to belong in lppaca.h, move them into their own header. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230823055317.751786-1-mpe@ellerman.id.au
2023-04-28Merge tag 'powerpc-6.4-1' of ↵Linus Torvalds1-14/+7
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add support for building the kernel using PC-relative addressing on Power10. - Allow HV KVM guests on Power10 to use prefixed instructions. - Unify support for the P2020 CPU (85xx) into a single machine description. - Always build the 64-bit kernel with 128-bit long double. - Drop support for several obsolete 2000's era development boards as identified by Paul Gortmaker. - A series fixing VFIO on Power since some generic changes. - Various other small features and fixes. Thanks to Alexey Kardashevskiy, Andrew Donnellan, Benjamin Gray, Bo Liu, Christophe Leroy, Dan Carpenter, David Binderman, Ira Weiny, Joel Stanley, Kajol Jain, Kautuk Consul, Liang He, Luis Chamberlain, Masahiro Yamada, Michael Neuling, Nathan Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin, Nick Desaulniers, Nysal Jan K.A, Pali Rohár, Paul Gortmaker, Paul Mackerras, Petr Vaněk, Randy Dunlap, Rob Herring, Sachin Sant, Sean Christopherson, Segher Boessenkool, and Timothy Pearson. * tag 'powerpc-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits) powerpc/64s: Disable pcrel code model on Clang powerpc: Fix merge conflict between pcrel and copy_thread changes powerpc/configs/powernv: Add IGB=y powerpc/configs/64s: Drop JFS Filesystem powerpc/configs/64s: Use EXT4 to mount EXT2 filesystems powerpc/configs: Make pseries_defconfig an alias for ppc64le_guest powerpc/configs: Make pseries_le an alias for ppc64le_guest powerpc/configs: Incorporate generic kvm_guest.config into guest configs powerpc/configs: Add IBMVETH=y and IBMVNIC=y to guest configs powerpc/configs/64s: Enable Device Mapper options powerpc/configs/64s: Enable PSTORE powerpc/configs/64s: Enable VLAN support powerpc/configs/64s: Enable BLK_DEV_NVME powerpc/configs/64s: Drop REISERFS powerpc/configs/64s: Use SHA512 for module signatures powerpc/configs/64s: Enable IO_STRICT_DEVMEM powerpc/configs/64s: Enable SCHEDSTATS powerpc/configs/64s: Enable DEBUG_VM & other options powerpc/configs/64s: Enable EMULATED_STATS powerpc/configs/64s: Enable KUNIT and most tests ...
2023-04-04powerpc: Use of_address_to_resource()Rob Herring1-14/+7
Replace open coded reading of "reg" or of_get_address()/ of_translate_address() calls with a single call to of_address_to_resource(). Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230329220337.141295-1-robh@kernel.org
2023-04-04powerpc/papr_scm: Update the NUMA distance table for the target nodeAneesh Kumar K.V1-0/+1
Platform device helper routines won't update the NUMA distance table while creating a platform device, even if the device is present on a NUMA node that doesn't have memory or CPU. This is especially true for pmem devices. If the target node of the pmem device is not online, we find the nearest online node to the device and associate the pmem device with that online node. To find the nearest online node, we should have the numa distance table updated correctly. Update the distance information during the device probe. For a papr scm device on NUMA node 3 distance_lookup_table value for distance_ref_points_depth = 2 before and after fix is below: Before fix: node 3 distance depth 0 - 0 node 3 distance depth 1 - 0 node 4 distance depth 0 - 4 node 4 distance depth 1 - 2 node 5 distance depth 0 - 5 node 5 distance depth 1 - 1 After fix node 3 distance depth 0 - 3 node 3 distance depth 1 - 1 node 4 distance depth 0 - 4 node 4 distance depth 1 - 2 node 5 distance depth 0 - 5 node 5 distance depth 1 - 1 Without the fix, the nearest numa node to the pmem device (NUMA node 3) will be picked as 4. After the fix, we get the correct numa node which is 5. Fixes: da1115fdbd6e ("powerpc/nvdimm: Pick nearby online node if the device node is not online") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230404041433.1781804-1-aneesh.kumar@linux.ibm.com
2022-09-30powerpc/mm: Always update max/min_low_pfn in mem_topology_setup()Aneesh Kumar K.V1-3/+3
For both CONFIG_NUMA enabled/disabled use mem_topology_setup() to update max/min_low_pfn. This also adds min_low_pfn update to CONFIG_NUMA which was initialized to zero before. (mpe: Though MEMORY_START is == 0 for PPC64=y which is all possible NUMA=y systems) Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220704063851.295482-1-aneesh.kumar@linux.ibm.com
2022-05-22powerpc/numa: Associate numa node to its cpu earlierOscar Salvador1-24/+8
powerpc is the only platform that do not rely on cpu_up()->try_online_node() to bring up a numa node, and special cases it, instead, deep in its own machinery: dlpar_online_cpu find_and_online_cpu_nid try_online_node This should not be needed, but the thing is that the try_online_node() from cpu_up() will not apply on the right node, because cpu_to_node() will return the old mapping numa<->cpu that gets set on boot stage for all possible cpus. That can be seen easily if we try to print out the numa node passed to try_online_node() in cpu_up(). The thing is that the numa<->cpu mapping does not get updated till a much later stage in start_secondary: start_secondary: set_numa_node(numa_cpu_lookup_table[cpu]) But we do not really care, as we already now the CPU <-> NUMA associativity back in find_and_online_cpu_nid(), so let us make use of that and set the proper numa<->cpu mapping, so cpu_to_node() in cpu_up() returns the right node and try_online_node() can do its work. Signed-off-by: Oscar Salvador <osalvador@suse.de> Tested-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220411074934.4632-1-osalvador@suse.de
2022-05-08powerpc: Remove asm/prom.h from all files that don't need itChristophe Leroy1-1/+0
Several files include asm/prom.h for no reason. Clean it up. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Drop change to prom_parse.c as reported by lkp@intel.com] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/7c9b8fda63dcf63e1b28f43e7ebdb95182cbc286.1646767214.git.christophe.leroy@csgroup.eu
2022-05-04powerpc/mm: Switch from __FUNCTION__ to __func__Dwaipayan Ray1-2/+1
__FUNCTION__ exists only for backwards compatibility reasons with old gcc versions. Replace it with __func__. Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com> [chleroy: Fixed parenthesis alignment] Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210711084536.95394-1-dwaipayanray1@gmail.com
2022-03-31powerpc/numa: Handle partially initialized numa nodesSrikar Dronamraju1-1/+1
With commit 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") NODE_DATA for even a memoryless/cpuless node is partially initialized at boot time. Before onlining the node, current Powerpc code checks for NODE_DATA to be NULL. However since NODE_DATA is partially initialized, this check will end up always being false. This causes hotplugging a CPU to a memoryless/cpuless node to fail. Before adding CPUs: $ numactl -H available: 1 nodes (4) node 4 cpus: 0 1 2 3 4 5 6 7 node 4 size: 97372 MB node 4 free: 95545 MB node distances: node 4 4: 10 $ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00 %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0 After hotplugging 32 cores: $ numactl -H node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 4 size: 97372 MB node 4 free: 93636 MB node distances: node 4 4: 10 $ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00 %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3 As we can see numactl is listing only 8 cores while lparstat is showing 33 cores. Also dmesg is showing messages like: [ 2261.318350 ] BUG: arch topology borken [ 2261.318357 ] the DIE domain not a subset of the NODE domain Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220330135123.1868197-1-srikar@linux.vnet.ibm.com
2022-03-01powerpc/mm/numa: skip NUMA_NO_NODE onlining in parse_numa_properties()Daniel Henrique Barboza1-1/+3
Executing node_set_online() when nid = NUMA_NO_NODE results in an undefined behavior. node_set_online() will call node_set_state(), into __node_set(), into set_bit(), and since NUMA_NO_NODE is -1 we'll end up doing a negative shift operation inside arch/powerpc/include/asm/bitops.h. This potential UB was detected running a kernel with CONFIG_UBSAN. The behavior was introduced by commit 10f78fd0dabb ("powerpc/numa: Fix a regression on memoryless node 0"), where the check for nid > 0 was removed to fix a problem that was happening with nid = 0, but the result is that now we're trying to online NUMA_NO_NODE nids as well. Checking for nid >= 0 will allow node 0 to be onlined while avoiding this UB with NUMA_NO_NODE. Fixes: 10f78fd0dabb ("powerpc/numa: Fix a regression on memoryless node 0") Reported-by: Ping Fang <pifang@redhat.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220224182312.1012527-1-danielhb413@gmail.com
2021-12-23powerpc/mm: Add __init attribute to eligible functionsNick Child1-3/+3
Some functions defined in 'arch/powerpc/mm' are deserving of an `__init` macro attribute. These functions are only called by other initialization functions and therefore should inherit the attribute. Also, change function declarations in header files to include `__init`. Signed-off-by: Nick Child <nick.child@ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211216220035.605465-4-nick.child@ibm.com
2021-11-15powerpc/pseries: Fix numa FORM2 parsing fallback codeNicholas Piggin1-16/+12
In case the FORM2 distance table from firmware is not the expected size, there is fallback code that just populates the lookup table as local vs remote. However it then continues on to use the distance table. Fix. Fixes: 1c6b5a7e7405 ("powerpc/pseries: Add support for FORM2 associativity") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211109064900.2041386-2-npiggin@gmail.com
2021-11-15powerpc/pseries: rename numa_dist_table to form2_distancesNicholas Piggin1-9/+9
The name of the local variable holding the "form2" property address conflicts with the numa_distance_table global. This patch does 's/numa_dist_table/form2_distances/g' over the function, which also renames numa_dist_table_length to form2_distances_length. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211109064900.2041386-1-npiggin@gmail.com
2021-08-27powerpc/numa: Update cpu_cpu_map on CPU online/offlineSrikar Dronamraju1-5/+2
cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when onlining/offlining of CPUs, this mask doesn't get updated. This mask is however updated when CPUs are added/removed. So when both operations like online/offline of CPUs and adding/removing of CPUs are done simultaneously, then cpumaps end up broken. WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 build_sched_domains+0xd48/0x1720 Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28 Workqueue: events cpuset_hotplug_workfn NIP: c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec REGS: c00000005596f220 TRAP: 0700 Not tainted (5.13.0-rc6+) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 48828222 XER: 00000009 CFAR: c0000000001ea698 IRQMASK: 0 GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036 GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90 GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8 GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800 GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18 GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400 GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060 GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720 LR [c0000000001caac4] build_sched_domains+0xd44/0x1720 Call Trace: [c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable) [c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0 [c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0 [c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70 [c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10 [c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590 [c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620 [c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0 [c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70 Instruction dump: 485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120 7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000 Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU online/offline. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210826100521.412639-5-srikar@linux.vnet.ibm.com
2021-08-27powerpc/numa: Print debug statements only when requiredSrikar Dronamraju1-5/+4
Currently, a debug message gets printed every time an attempt to add(remove) a CPU. However this is redundant if the CPU is already added (removed) from the node. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210826100521.412639-4-srikar@linux.vnet.ibm.com
2021-08-27powerpc/numa: convert printk to pr_xxxSrikar Dronamraju1-13/+9
Convert the remaining printk to pr_xxx One advantage would be all prints will now have prefix "numa:" from pr_fmt(). [ convert printk(KERN_ERR) to pr_warn : Suggested by Laurent Dufour ] Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [mpe: Rebase onto powerpc/next, s/WARNING/Warning/] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210826100521.412639-3-srikar@linux.vnet.ibm.com
2021-08-27powerpc/numa: Drop dbg in favour of pr_debugSrikar Dronamraju1-15/+9
powerpc supported numa=debug which is not documented. This option was used to print early debug output. However something more flexible can be achieved by using CONFIG_DYNAMIC_DEBUG. Hence drop dbg (and numa=debug) in favour of pr_debug Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [mpe: Rebase on to powerpc/next form2 affinity changes] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210826100521.412639-2-srikar@linux.vnet.ibm.com
2021-08-13powerpc/pseries: Add support for FORM2 associativityAneesh Kumar K.V1-35/+152
PAPR interface currently supports two different ways of communicating resource grouping details to the OS. These are referred to as Form 0 and Form 1 associativity grouping. Form 0 is the older format and is now considered deprecated. This patch adds another resource grouping named FORM2. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210812132223.225214-6-aneesh.kumar@linux.ibm.com
2021-08-13powerpc/pseries: Add a helper for form1 cpu distanceAneesh Kumar K.V1-1/+9
This helper is only used with the dispatch trace log collection. A later patch will add Form2 affinity support and this change helps in keeping that simpler. Also add a comment explaining we don't expect the code to be called with FORM0 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210812132223.225214-5-aneesh.kumar@linux.ibm.com
2021-08-13powerpc/pseries: Consolidate different NUMA distance update code pathsAneesh Kumar K.V1-57/+155
The associativity details of the newly added resourced are collected from the hypervisor via "ibm,configure-connector" rtas call. Update the numa distance details of the newly added numa node after the above call. Instead of updating NUMA distance every time we lookup a node id from the associativity property, add helpers that can be used during boot which does this only once. Also remove the distance update from node id lookup helpers. Currently, we duplicate parsing code for ibm,associativity and ibm,associativity-lookup-arrays in the kernel. The associativity array provided by these device tree properties are very similar and hence can use a helper to parse the node id and numa distance details. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210812132223.225214-4-aneesh.kumar@linux.ibm.com
2021-08-13powerpc/pseries: Rename TYPE1_AFFINITY to FORM1_AFFINITYAneesh Kumar K.V1-14/+21
Also make related code cleanup that will allow adding FORM2_AFFINITY in later patches. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210812132223.225214-3-aneesh.kumar@linux.ibm.com
2021-08-13powerpc/pseries: rename min_common_depth to primary_domain_indexAneesh Kumar K.V1-19/+19
No functional change in this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210812132223.225214-2-aneesh.kumar@linux.ibm.com
2021-08-10powerpc/numa: Consider the max NUMA node for migratable LPARLaurent Dufour1-3/+10
When a LPAR is migratable, we should consider the maximum possible NUMA node instead of the number of NUMA nodes from the actual system. The DT property 'ibm,current-associativity-domains' defines the maximum number of nodes the LPAR can see when running on that box. But if the LPAR is being migrated on another box, it may see up to the nodes defined by 'ibm,max-associativity-domains'. So if a LPAR is migratable, that value should be used. Unfortunately, there is no easy way to know if an LPAR is migratable or not. The hypervisor exports the property 'ibm,migratable-partition' in the case it set to migrate partition, but that would not mean that the current partition is migratable. Without this patch, when a LPAR is started on a 2 node box and then migrated to a 3 node box, the hypervisor may spread the LPAR's CPUs on the 3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it will be wrongly assigned to the node because the kernel has been set to use up to 2 nodes (the configuration of the departure node). With this patch applies, the CPU is correctly added to the 3rd node. Fixes: f9f130ff2ec9 ("powerpc/numa: Detect support for coregroup") Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210511073136.17795-1-ldufour@linux.ibm.com
2020-11-27powerpc/numa: Fix a regression on memoryless node 0Srikar Dronamraju1-2/+1
Commit e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0") offlines node 0 and expects nodes to be subsequently onlined when CPUs or nodes are detected. Commit 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn") skips onlining node 0 when CPUs are associated with node 0. On systems with node 0 having CPUs but no memory, this causes node 0 be marked offline. This causes issues at boot time when trying to set memory node for online CPUs while building the zonelist. 0:mon> t [link register ] c000000000400354 __build_all_zonelists+0x164/0x280 [c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable) [c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280 [c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118 [c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0 [c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8 [c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8 0:mon> r R00 = c000000000400354 R16 = 000000000b57a0e8 R01 = c00000000161bda0 R17 = 000000000b57a6b0 R02 = c00000000161ce00 R18 = 000000000b5afee8 R03 = 0000000000000000 R19 = 000000000b6448a0 R04 = 0000000000000000 R20 = fffffffffffffffd R05 = 0000000000000000 R21 = 0000000001400000 R06 = 0000000000000000 R22 = 000000001ec00000 R07 = 0000000000000001 R23 = c000000001175580 R08 = 0000000000000000 R24 = c000000001651ed8 R09 = c0000000017e84d8 R25 = c000000001652480 R10 = 0000000000000000 R26 = c000000001175584 R11 = c000000c7fac0d10 R27 = c0000000019568d0 R12 = c000000000400180 R28 = 0000000000000000 R13 = c000000002200000 R29 = c00000000164dd78 R14 = 000000000b579f78 R30 = 0000000000000000 R15 = 000000000b57a2b8 R31 = c000000001175584 pc = c000000000400194 local_memory_node+0x24/0x80 cfar= c000000000074334 mcount+0xc/0x10 lr = c000000000400354 __build_all_zonelists+0x164/0x280 msr = 8000000002001033 cr = 44002284 ctr = c000000000400180 xer = 0000000000000001 trap = 380 dar = 0000000000001388 dsisr = c00000000161bc90 0:mon> Fix this by setting node to be online while onlining CPUs that belong to node 0. Fixes: e75130f20b1f ("powerpc/numa: Offline memoryless cpuless node 0") Fixes: 6398eaa26816 ("powerpc/numa: Prefer node id queried from vphn") Reported-by: Milan Mohanty <milmohan@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com
2020-10-16Merge tag 'powerpc-5.10-1' of ↵Linus Torvalds1-17/+84
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting it for powerpc, as well as a related fix for sparc. - Remove support for PowerPC 601. - Some fixes for watchpoints & addition of a new ptrace flag for detecting ISA v3.1 (Power10) watchpoint features. - A fix for kernels using 4K pages and the hash MMU on bare metal Power9 systems with > 16TB of RAM, or RAM on the 2nd node. - A basic idle driver for shallow stop states on Power10. - Tweaks to our sched domains code to better inform the scheduler about the hardware topology on Power9/10, where two SMT4 cores can be presented by firmware as an SMT8 core. - A series doing further reworks & cleanups of our EEH code. - Addition of a filter for RTAS (firmware) calls done via sys_rtas(), to prevent root from overwriting kernel memory. - Other smaller features, fixes & cleanups. Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero, Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha, Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt, Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang Yingliang, zhengbin. * tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits) Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed" selftests/powerpc: Fix eeh-basic.sh exit codes cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier powerpc/time: Make get_tb() common to PPC32 and PPC64 powerpc/time: Make get_tbl() common to PPC32 and PPC64 powerpc/time: Remove get_tbu() powerpc/time: Avoid using get_tbl() and get_tbu() internally powerpc/time: Make mftb() common to PPC32 and PPC64 powerpc/time: Rename mftbl() to mftb() powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S powerpc/32s: Rename head_32.S to head_book3s_32.S powerpc/32s: Setup the early hash table at all time. powerpc/time: Remove ifdef in get_dec() and set_dec() powerpc: Remove get_tb_or_rtc() powerpc: Remove __USE_RTC() powerpc: Tidy up a bit after removal of PowerPC 601. powerpc: Remove support for PowerPC 601 powerpc: Remove PowerPC 601 powerpc: Drop SYNC_601() ISYNC_601() and SYNC() powerpc: Remove CONFIG_PPC601_SYNC_FIX ...
2020-10-13arch, mm: replace for_each_memblock() with for_each_mem_pfn_range()Mike Rapoport1-5/+2
There are several occurrences of the following pattern: for_each_memblock(memory, reg) { start_pfn = memblock_region_memory_base_pfn(reg); end_pfn = memblock_region_memory_end_pfn(reg); /* do something with start_pfn and end_pfn */ } Rather than iterate over all memblock.memory regions and each time query for their start and end PFNs, use for_each_mem_pfn_range() iterator to get simpler and clearer code. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [.clang-format] Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Axtens <dja@axtens.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Emil Renner Berthing <kernel@esmil.dk> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-06pseries/hotplug-memory: hot-add: skip redundant LMB lookupScott Cheloha1-1/+1
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid() to determine which node id (nid) to use when later calling __add_memory(). This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an appropriate nid for a given address by looking up the LMB containing the address and then passing that LMB to of_drconf_to_nid_single() to get the nid. In dlpar_add_lmb() we get this address from the LMB itself. In short, we have a pointer to an LMB and then we are searching for that LMB *again* in order to find its nid. If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we can skip the redundant lookup. The only error handling we need to duplicate from memory_add_physaddr_to_nid() is the fallback to the default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or an invalid nid. Skipping the extra lookup makes hot-add operations faster, especially on machines with many LMBs. Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000 LMBs on an upatched kernel took ~3.5 hours while a patched kernel completed the same operation in ~2 hours: Unpatched (12450 seconds): Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000 Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added Patched (7065 seconds): Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000 Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s) [...] Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added It should be noted that the speedup grows more substantial when hot-adding LMBs at the end of the drconf range. This is because we are skipping a linear LMB search. To see the distinction, consider smaller hot-add test on the same LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096 LMBs completed less than 1 second faster on a patched kernel: Unpatched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,753.42 msec task-clock # 0.992 CPUs utilized ( +- 0.55% ) 4,708 context-switches # 0.045 K/sec ( +- 0.69% ) 2,444 cpu-migrations # 0.023 K/sec ( +- 1.25% ) 394 page-faults # 0.004 K/sec ( +- 0.22% ) 445,902,503,057 cycles # 4.257 GHz ( +- 0.55% ) (66.67%) 8,558,376,740 stalled-cycles-frontend # 1.92% frontend cycles idle ( +- 0.88% ) (49.99%) 300,346,181,651 stalled-cycles-backend # 67.36% backend cycles idle ( +- 0.76% ) (50.01%) 258,091,488,691 instructions # 0.58 insn per cycle # 1.16 stalled cycles per insn ( +- 0.22% ) (66.67%) 70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%) 3,100,725,426 branch-misses # 4.39% of all branches ( +- 0.20% ) (49.99%) 105.583 +- 0.589 seconds time elapsed ( +- 0.56% ) Patched: Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): 104,055.69 msec task-clock # 0.993 CPUs utilized ( +- 0.32% ) 4,606 context-switches # 0.044 K/sec ( +- 0.20% ) 2,463 cpu-migrations # 0.024 K/sec ( +- 0.93% ) 394 page-faults # 0.004 K/sec ( +- 0.25% ) 442,951,129,921 cycles # 4.257 GHz ( +- 0.32% ) (66.66%) 8,710,413,329 stalled-cycles-frontend # 1.97% frontend cycles idle ( +- 0.47% ) (50.06%) 299,656,905,836 stalled-cycles-backend # 67.65% backend cycles idle ( +- 0.39% ) (50.02%) 252,731,168,193 instructions # 0.57 insn per cycle # 1.19 stalled cycles per insn ( +- 0.20% ) (66.66%) 68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%) 3,100,242,882 branch-misses # 4.50% of all branches ( +- 0.15% ) (49.98%) 104.829 +- 0.325 seconds time elapsed ( +- 0.31% ) This is consistent. An add-by-count hot-add operation adds LMBs greedily, so LMBs near the start of the drconf range are considered first. On an otherwise idle LPAR with so many LMBs we would expect to find the LMBs we need near the start of the drconf range, hence the smaller speedup. Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
2020-09-16powerpc/smp: Implement cpu_to_coregroup_idSrikar Dronamraju1-0/+20
Lookup the coregroup id from the associativity array. If unable to detect the coregroup id, fallback on the core id. This way, ensure sched_domain degenerates and an extra sched domain is not created. Ideally this function should have been implemented in arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we don't need to find the primary domain again. If the device-tree mentions more than one coregroup, then kernel implements only the last or the smallest coregroup, which currently corresponds to the penultimate domain in the device-tree. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
2020-09-16powerpc/smp: Create coregroup domainSrikar Dronamraju1-0/+5
Add percpu coregroup maps and masks to create coregroup domain. If a coregroup doesn't exist, the coregroup domain will be degenerated in favour of SMT/CACHE domain. Do note this patch is only creating stubs for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation would be in a subsequent patch. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
2020-09-16powerpc/numa: Detect support for coregroupSrikar Dronamraju1-13/+21
Add support for grouping cores based on the device-tree classification. - The last domain in the associativity domains always refers to the core. - If primary reference domain happens to be the penultimate domain in the associativity domains device-tree property, then there are no coregroups. However if its not a penultimate domain, then there are coregroups. There can be more than one coregroup. For now we would be interested in the last or the smallest coregroups, i.e one sub-group per DIE. Currently there are no firmwares that are exposing this grouping. Hence allow the basis for grouping to be abstract. Once the firmware starts using this grouping, code would be added to detect the type of grouping and adjust the sd domain flags accordingly. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
2020-09-16powerpc/numa: Offline memoryless cpuless node 0Srikar Dronamraju1-0/+10
Currently Linux kernel with CONFIG_NUMA on a system with multiple possible nodes, marks node 0 as online at boot. However in practice, there are systems which have node 0 as memoryless and cpuless. This can cause numa_balancing to be enabled on systems with only one node with memory and CPUs. The existence of this dummy node which is cpuless and memoryless node can confuse users/scripts looking at output of lscpu / numactl. By marking, node 0 as offline, lets stop assuming that node 0 is always online. If node 0 has CPU or memory that are online, node 0 will again be set as online. v5.8 available: 2 nodes (0,2) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31490 MB node distances: node 0 2 0: 10 20 2: 20 10 proc and sys files ------------------ /sys/devices/system/node/online: 0,2 /proc/sys/kernel/numa_balancing: 1 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 v5.8 + patch ------------------ available: 1 nodes (2) node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31487 MB node distances: node 2 2: 10 proc and sys files ------------------ /sys/devices/system/node/online: 2 /proc/sys/kernel/numa_balancing: 0 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 Example of a node with online CPUs/memory on node 0. (Same o/p with and without patch) numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 0 size: 32482 MB node 0 free: 22994 MB node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 Note: On Powerpc, cpu_to_node of possible but not present cpus would previously return 0. Hence this commit depends on commit ("powerpc/numa: Set numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id queried from vphn"). Without the 2 commits, Powerpc system might crash. 1. User space applications like Numactl, lscpu, that parse the sysfs tend to believe there is an extra online node. This tends to confuse users and applications. Other user space applications start believing that system was not able to use all the resources (i.e missing resources) or the system was not setup correctly. 2. Also existence of dummy node also leads to inconsistent information. The number of online nodes is inconsistent with the information in the device-tree and resource-dump 3. When the dummy node is present, single node non-Numa systems end up showing up as NUMA systems and numa_balancing gets enabled. This will mean we take the hit from the unnecessary numa hinting faults. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
2020-09-16powerpc/numa: Prefer node id queried from vphnSrikar Dronamraju1-9/+10
Node id queried from the static device tree may not be correct. For example: it may always show 0 on a shared processor. Hence prefer the node id queried from vphn and fallback on the device tree based node id if vphn query fails. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
2020-09-16powerpc/numa: Set numa_node for all possible cpusSrikar Dronamraju1-1/+15
A Powerpc system with multiple possible nodes and with CONFIG_NUMA enabled always used to have a node 0, even if node 0 does not any cpus or memory attached to it. As per PAPR, node affinity of a cpu is only available once its present / online. For all cpus that are possible but not present, cpu_to_node() would point to node 0. To ensure a cpuless, memoryless dummy node is not online, powerpc need to make sure all possible but not present cpu_to_node are set to a proper node. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
2020-09-16powerpc/numa: Restrict possible nodes based on platformSrikar Dronamraju1-3/+12
As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time Abstraction Services (RTAS) Node" available at: https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf ... there are 2 device tree properties: "ibm,max-associativity-domains" which defines the maximum number of domains that the firmware i.e PowerVM can support. and: "ibm,current-associativity-domains" which defines the maximum number of domains that the current platform can support. The value of "ibm,max-associativity-domains" is always greater than or equal to "ibm,current-associativity-domains" property. If the latter property is not available, use "ibm,max-associativity-domain" as a fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains" is mentioned in page 833 / B.5.3 which is covered under under "Appendix B. System Binding" section Currently powerpc uses the "ibm,max-associativity-domains" property while setting the possible number of nodes. This is currently set at 32. However the possible number of nodes for a platform may be significantly less. Hence set the possible number of nodes based on "ibm,current-associativity-domains" property. Nathan Lynch had raised a valid concern that post LPM (Live Partition Migration), a user could DLPAR add processors and memory after LPM with "new" associativity properties: https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u He also pointed out that "ibm,max-associativity-domains" has the same contents on all currently available PowerVM systems, unlike "ibm,current-associativity-domains" and hence may be better able to handle the new NUMA associativity properties. However with the recent commit dbce45628085 ("powerpc/numa: Limit possible nodes to within num_possible_nodes"), all new NUMA associativity properties are capped to initially set nr_node_ids. Hence this commit should be safe with any new DLPAR add post LPM. $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains /proc/device-tree/rtas/ibm,current-associativity-domains 00000005 00000001 00000002 00000002 00000002 00000010 /proc/device-tree/rtas/ibm,max-associativity-domains 00000005 00000001 00000008 00000020 00000020 00000100 $ cat /sys/devices/system/node/possible ##Before patch 0-31 $ cat /sys/devices/system/node/possible ##After patch 0-1 Note the maximum nodes this platform can support is only 2 but the possible nodes is set to 32. This is important because lot of kernel and user space code allocate structures for all possible nodes leading to a lot of memory that is allocated but not used. I ran a simple experiment to create and destroy 100 memory cgroups on boot on a 8 node machine (Power8 Alpine). Before patch: free -k at boot total used free shared buff/cache available Mem: 523498176 4106816 518820608 22272 570752 516606720 Swap: 4194240 0 4194240 free -k after creating 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4628416 518246464 22336 623296 516058688 Swap: 4194240 0 4194240 free -k after destroying 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4697408 518173760 22400 627008 515987904 Swap: 4194240 0 4194240 After patch: free -k at boot total used free shared buff/cache available Mem: 523498176 3969472 518933888 22272 594816 516731776 Swap: 4194240 0 4194240 free -k after creating 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4181888 518676096 22208 640192 516496448 Swap: 4194240 0 4194240 free -k after destroying 100 memory cgroups total used free shared buff/cache available Mem: 523498176 4232320 518619904 22272 645952 516443264 Swap: 4194240 0 4194240 Observations: Fixed kernel takes 137344 kb (4106816-3969472) less to boot. Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [mpe: Reformat change log a bit for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
2020-08-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+0
Merge misc updates from Andrew Morton: - a few MM hotfixes - kthread, tools, scripts, ntfs and ocfs2 - some of MM Subsystems affected by this patch series: kthread, tools, scripts, ntfs, ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan, debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore, sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) mm: vmscan: consistent update to pgrefill mm/vmscan.c: fix typo khugepaged: khugepaged_test_exit() check mmget_still_valid() khugepaged: retract_page_tables() remember to test exit khugepaged: collapse_pte_mapped_thp() protect the pmd lock khugepaged: collapse_pte_mapped_thp() flush the right range mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible mm: thp: replace HTTP links with HTTPS ones mm/page_alloc: fix memalloc_nocma_{save/restore} APIs mm/page_alloc.c: skip setting nodemask when we are in interrupt mm/page_alloc: fallbacks at most has 3 elements mm/page_alloc: silence a KASAN false positive mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask() mm/page_alloc.c: simplify pageblock bitmap access mm/page_alloc.c: extract the common part in pfn_to_bitidx() mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits mm/shuffle: remove dynamic reconfiguration mm/memory_hotplug: document why shuffle_zone() is relevant mm/page_alloc: remove nr_free_pagecache_pages() mm: remove vm_total_pages ...
2020-08-07mm/sparse: cleanup the code surrounding memory_present()Mike Rapoport1-1/+0
After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent functions that call memory_present() for each region in memblock.memory: sparse_memory_present_with_active_regions() and membocks_present(). Moreover, all architectures have a call to either of these functions preceding the call to sparse_init() and in the most cases they are called one after the other. Mark the regions from memblock.memory as present during sparce_init() by making sparse_init() call memblocks_present(), make memblocks_present() and memory_present() functions static and remove redundant sparse_memory_present_with_active_regions() function. Also remove no longer required HAVE_MEMORY_PRESENT configuration option. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-29powerpc/drmem: Make LMB walk a bit more flexibleHari Bathini1-5/+8
Currently, numa & prom are the only users of drmem LMB walk code. Loading kdump with kexec_file also needs to walk the drmem LMBs to setup the usable memory ranges for kdump kernel. But there are couple of issues in using the code as is. One, walk_drmem_lmb() code is built into the .init section currently, while kexec_file needs it later. Two, there is no scope to pass data to the callback function for processing and/or erroring out on certain conditions. Fix that by, moving drmem LMB walk code out of .init section, adding scope to pass data to the callback function and bailing out when an error is encountered in the callback function. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Tested-by: Pingfan Liu <piliu@redhat.com> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/159602282727.575379.3979857013827701828.stgit@hbathini
2020-07-26powerpc/numa: Limit possible nodes to within num_possible_nodesSrikar Dronamraju1-3/+4
MAX_NUMNODES is a theoretical maximum number of nodes thats is supported by the kernel. Device tree properties exposes the number of possible nodes on the current platform. The kernel would detected this and would use it for most of its resource allocations. If the platform now increases the nodes to over what was already exposed, then it may lead to inconsistencies. Hence limit it to the already exposed nodes. Suggested-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200724105809.24733-1-srikar@linux.vnet.ibm.com
2020-07-16powerpc/numa: remove arch_update_cpu_topologyNathan Lynch1-10/+0
Since arch_update_cpu_topology() doesn't do anything on powerpc now, remove it and associated dead code. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-15-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove prrn_is_enabled()Nathan Lynch1-5/+0
All users of this prrn_is_enabled() are gone; remove it. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-14-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove start/stop_topology_update()Nathan Lynch1-20/+0
These APIs have become no-ops, so remove them and all call sites. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-12-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove timed_topology_update()Nathan Lynch1-9/+0
timed_topology_update is a no-op now, so remove it and all call sites. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-11-nathanl@linux.ibm.com
2020-07-16powerpc/numa: stub out numa_update_cpu_topology()Nathan Lynch1-192/+1
Previous changes have removed the code which sets bits in cpu_associativity_changes_mask and thus it is never modifed at runtime. From this we can reason that numa_update_cpu_topology() always returns 0 without doing anything. Remove the body of numa_update_cpu_topology() and remove all code which becomes unreachable as a result. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-10-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove vphn_enabled and prrn_enabled internal flagsNathan Lynch1-4/+2
These flags are always zero now; remove them and suitably adjust the remaining references to them. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-9-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove unreachable topology workqueue codeNathan Lynch1-14/+0
Since vphn_enabled is always 0, we can remove the call to topology_schedule_update() and remove the code which becomes unreachable as a result. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-8-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove unreachable topology timer codeNathan Lynch1-21/+0
Since vphn_enabled is always 0, we can stub out timed_topology_update() and remove the code which becomes unreachable. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-7-nathanl@linux.ibm.com
2020-07-16powerpc/numa: make vphn_enabled, prrn_enabled flags constNathan Lynch1-2/+2
Previous changes have made it so these flags are never changed; enforce this by making them const. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-6-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove unreachable topology update codeNathan Lynch1-147/+2
Since the topology_updates_enabled flag is now always false, remove it and the code which has become unreachable. This is the minimum change that prevents 'defined but unused' warnings emitted by the compiler after stubbing out the start/stop_topology_updates() functions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-5-nathanl@linux.ibm.com
2020-07-16powerpc/numa: remove ability to enable topology updatesNathan Lynch1-70/+1
Remove the /proc/powerpc/topology_updates interface and the topology_updates=on/off command line argument. The internal topology_updates_enabled flag remains for now, but always false. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-4-nathanl@linux.ibm.com
2020-03-04powerpc/numa: Remove late request for home node associativitySrikar Dronamraju1-9/+0
With commit ("powerpc/numa: Early request for home node associativity"), commit 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot") which was requesting home node associativity becomes redundant. Hence remove the late request for home node associativity. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200129135301.24739-6-srikar@linux.vnet.ibm.com
2020-03-04powerpc/numa: Early request for home node associativitySrikar Dronamraju1-1/+40
Currently the kernel detects if its running on a shared lpar platform and requests home node associativity before the scheduler sched_domains are setup. However between the time NUMA setup is initialized and the request for home node associativity, workqueue initializes its per node cpumask. The per node workqueue possible cpumask may turn invalid after home node associativity resulting in weird situations like workqueue possible cpumask being a subset of workqueue online cpumask. This can be fixed by requesting home node associativity earlier just before NUMA setup. However at the NUMA setup time, kernel may not be in a position to detect if its running on a shared lpar platform. So request for home node associativity and if the request fails, fallback on the device tree property. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200129135301.24739-5-srikar@linux.vnet.ibm.com
2020-03-04powerpc/numa: Use cpu node map of first sibling threadSrikar Dronamraju1-2/+20
All the sibling threads of a core have to be part of the same node. To ensure that all the sibling threads map to the same node, always lookup/update the cpu-to-node map of the first thread in the core. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200129135301.24739-4-srikar@linux.vnet.ibm.com
2020-03-04powerpc/numa: Handle extra hcall_vphn error casesSrikar Dronamraju1-9/+16
Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes. However hcall_vphn can return other return codes. Now it also handles H_PARAMETER return code. Also the rest return codes are handled under the default case. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200129135301.24739-3-srikar@linux.vnet.ibm.com
2020-02-04proc: convert everything to "struct proc_ops"Alexey Dobriyan1-6/+6
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-13Merge tag 'powerpc-5.3-1' of ↵Linus Torvalds1-26/+35
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Removal of the NPU DMA code, used by the out-of-tree Nvidia driver, as well as some other functions only used by drivers that haven't (yet?) made it upstream. - A fix for a bug in our handling of hardware watchpoints (eg. perf record -e mem: ...) which could lead to register corruption and kernel crashes. - Enable HAVE_ARCH_HUGE_VMAP, which allows us to use large pages for vmalloc when using the Radix MMU. - A large but incremental rewrite of our exception handling code to use gas macros rather than multiple levels of nested CPP macros. And the usual small fixes, cleanups and improvements. Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Andreas Schwab, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Cédric Le Goater, Christian Lamparter, Christophe Leroy, Christophe Lombard, Christoph Hellwig, Daniel Axtens, Denis Efremov, Enrico Weigelt, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geliang Tang, Gen Zhang, Greg Kroah-Hartman, Greg Kurz, Gustavo Romero, Krzysztof Kozlowski, Madhavan Srinivasan, Masahiro Yamada, Mathieu Malaterre, Michael Neuling, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nishad Kamdar, Oliver O'Halloran, Qian Cai, Ravi Bangoria, Sachin Sant, Sam Bobroff, Satheesh Rajendran, Segher Boessenkool, Shaokun Zhang, Shawn Anastasio, Stewart Smith, Suraj Jitindar Singh, Thiago Jung Bauermann, YueHaibing" * tag 'powerpc-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (163 commits) powerpc/powernv/idle: Fix restore of SPRN_LDBAR for POWER9 stop state. powerpc/eeh: Handle hugepages in ioremap space ocxl: Update for AFU descriptor template version 1.1 powerpc/boot: pass CONFIG options in a simpler and more robust way powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h powerpc/irq: Don't WARN continuously in arch_local_irq_restore() powerpc/module64: Use symbolic instructions names. powerpc/module32: Use symbolic instructions names. powerpc: Move PPC_HA() PPC_HI() and PPC_LO() to ppc-opcode.h powerpc/module64: Fix comment in R_PPC64_ENTRY handling powerpc/boot: Add lzo support for uImage powerpc/boot: Add lzma support for uImage powerpc/boot: don't force gzipped uImage powerpc/8xx: Add microcode patch to move SMC parameter RAM. powerpc/8xx: Use IO accessors in microcode programming. powerpc/8xx: replace #ifdefs by IS_ENABLED() in microcode.c powerpc/8xx: refactor programming of microcode CPM params. powerpc/8xx: refactor printing of microcode patch name. powerpc/8xx: Refactor microcode write powerpc/8xx: refactor writing of CPM microcode arrays ...
2019-07-05powerpc/mm: Consolidate numa_enable check and min_common_depth checkAneesh Kumar K.V1-5/+11
If we fail to parse min_common_depth from device tree we boot with numa disabled. Reflect the same by updating numa_enabled variable to false. Also, switch all min_common_depth failure check to if (!numa_enabled) check. This helps us to avoid checking for both in different code paths. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05powerpc/mm: Fix node look up with numa=off bootAneesh Kumar K.V1-3/+3
If we boot with numa=off, we need to make sure we return NUMA_NO_NODE when looking up associativity details of resources. Without this, we hit crash like below BUG: Unable to handle kernel data access at 0x40000000008 Faulting instruction address: 0xc000000008f31704 cpu 0x1b: Vector: 380 (Data SLB Access) at [c00000000b9bb320] pc: c000000008f31704: _raw_spin_lock+0x14/0x100 lr: c0000000083f41fc: ____cache_alloc_node+0x5c/0x290 sp: c00000000b9bb5b0 msr: 800000010280b033 dar: 40000000008 current = 0xc00000000b9a2700 paca = 0xc00000000a740c00 irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/27 Linux version 5.2.0-rc4-00925-g74e188c620b1 (root@linux-d8ip) (gcc version 7.4.1 20190424 [gcc-7-branch revision 270538] (SUSE Linux)) #34 SMP Sat Jun 29 00:41:02 EDT 2019 enter ? for help [link register ] c0000000083f41fc ____cache_alloc_node+0x5c/0x290 [c00000000b9bb5b0] 0000000000000dc0 (unreliable) [c00000000b9bb5f0] c0000000083f48c8 kmem_cache_alloc_node_trace+0x138/0x360 [c00000000b9bb670] c000000008aa789c devres_alloc_node+0x4c/0xa0 [c00000000b9bb6a0] c000000008337218 devm_memremap+0x58/0x130 [c00000000b9bb6f0] c000000008aed00c devm_nsio_enable+0xdc/0x170 [c00000000b9bb780] c000000008af3b6c nd_pmem_probe+0x4c/0x180 [c00000000b9bb7b0] c000000008ad84cc nvdimm_bus_probe+0xac/0x260 [c00000000b9bb840] c000000008aa0628 really_probe+0x148/0x500 [c00000000b9bb8d0] c000000008aa0d7c driver_probe_device+0x19c/0x1d0 [c00000000b9bb950] c000000008aa11bc device_driver_attach+0xcc/0x100 [c00000000b9bb990] c000000008aa12ec __driver_attach+0xfc/0x1e0 [c00000000b9bba10] c000000008a9d0a4 bus_for_each_dev+0xb4/0x130 [c00000000b9bba70] c000000008a9fc04 driver_attach+0x34/0x50 [c00000000b9bba90] c000000008a9f118 bus_add_driver+0x1d8/0x300 [c00000000b9bbb20] c000000008aa2358 driver_register+0x98/0x1a0 [c00000000b9bbb90] c000000008ad7e6c __nd_driver_register+0x5c/0x100 [c00000000b9bbbf0] c0000000093efbac nd_pmem_driver_init+0x34/0x48 [c00000000b9bbc10] c0000000080106c0 do_one_initcall+0x60/0x2d0 [c00000000b9bbce0] c00000000938463c kernel_init_freeable+0x384/0x48c [c00000000b9bbdb0] c000000008010a5c kernel_init+0x2c/0x160 [c00000000b9bbe20] c00000000800ba54 ret_from_kernel_thread+0x5c/0x68 Reported-and-debugged-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-05powerpc/mm/drconf: Use NUMA_NO_NODE on failures instead of node 0Aneesh Kumar K.V1-4/+6
If we fail to parse the associativity array we should default to NUMA_NO_NODE instead of NODE 0. Rest of the code fallback to the right default if we find the numa node value NUMA_NO_NODE. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-04powerpc/pseries: Provide vcpu dispatch statisticsNaveen N. Rao1-0/+16
For Shared Processor LPARs, the POWER Hypervisor maintains a relatively static mapping of the LPAR processors (vcpus) to physical processor chips (representing the "home" node) and tries to always dispatch vcpus on their associated physical processor chip. However, under certain scenarios, vcpus may be dispatched on a different processor chip (away from its home node). The actual physical processor number on which a certain vcpu is dispatched is available to the guest in the 'processor_id' field of each DTL entry. The guest can discover the home node of each vcpu through the H_HOME_NODE_ASSOCIATIVITY(flags=1) hcall. The guest can also discover the associativity of physical processors, as represented in the DTL entry, through the H_HOME_NODE_ASSOCIATIVITY(flags=2) hcall. These can then be compared to determine if the vcpu was dispatched on its home node or not. If the vcpu was not dispatched on the home node, it is possible to determine if the vcpu was dispatched in a different chip, socket or drawer. Introduce a procfs file /proc/powerpc/vcpudispatch_stats that can be used to obtain these statistics. Writing '1' to this file enables collecting the statistics, while writing '0' disables the statistics. The statistics themselves are available by reading the procfs file. By default, the DTLB log for each vcpu is processed 50 times a second so as not to miss any entries. This processing frequency can be changed through /proc/powerpc/vcpudispatch_stats_freq. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-04powerpc/pseries: Move mm/book3s64/vphn.c under platforms/pseries/Naveen N. Rao1-14/+0
hcall_vphn() is specific to pseries and will be used in a subsequent patch. So, move it to a more appropriate place under arch/powerpc/platforms/pseries. Also merge vphn.h into lppaca.h and update vphn selftest to use the new files. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-07-04powerpc/pseries: Generalize hcall_vphn()Naveen N. Rao1-14/+13
H_HOME_NODE_ASSOCIATIVITY hcall can take two different flags and return different associativity information in each case. Generalize the existing hcall_vphn() function to take flags as an argument and to return the result. Update the only existing user to pass the proper arguments. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner1-5/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-03powerpc/mm: Move book3s64 specifics in subdirectory mm/book3s64Christophe Leroy1-1/+1
Many files in arch/powerpc/mm are only for book3S64. This patch creates a subdirectory for them. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Update the selftest sym links, shorten new filenames, cleanup some whitespace and formatting in the new files.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/numa: document topology_updates_enabled, disable by defaultNathan Lynch1-4/+10
Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/numa: improve control of topology updatesNathan Lynch1-6/+12
When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc: Remove duplicate headersJagadeesh Pagadala1-1/+0
Remove duplicate headers inclusions. Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-12memblock: memblock_phys_alloc_try_nid(): don't panicMike Rapoport1-0/+4
The memblock_phys_alloc_try_nid() function tries to allocate memory from the requested node and then falls back to allocation from any node in the system. The memblock_alloc_base() fallback used by this function panics if the allocation fails. Replace the memblock_alloc_base() fallback with the direct call to memblock_alloc_range_nid() and update the memblock_phys_alloc_try_nid() callers to check the returned value and panic in case of error. Link: http://lkml.kernel.org/r/1548057848-15136-7-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07Merge tag 'powerpc-5.1-1' of ↵Linus Torvalds1-8/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Enable THREAD_INFO_IN_TASK to move thread_info off the stack. - A big series from Christoph reworking our DMA code to use more of the generic infrastructure, as he said: "This series switches the powerpc port to use the generic swiotlb and noncoherent dma ops, and to use more generic code for the coherent direct mapping, as well as removing a lot of dead code." - Increase our vmalloc space to 512T with the Hash MMU on modern CPUs, allowing us to support machines with larger amounts of total RAM or distance between nodes. - Two series from Christophe, one to optimise TLB miss handlers on 6xx, and another to optimise the way STRICT_KERNEL_RWX is implemented on some 32-bit CPUs. - Support for KCOV coverage instrumentation which means we can run syzkaller and discover even more bugs in our code. And as always many clean-ups, reworks and minor fixes etc. Thanks to: Alan Modra, Alexey Kardashevskiy, Alistair Popple, Andrea Arcangeli, Andrew Donnellan, Aneesh Kumar K.V, Aravinda Prasad, Balbir Singh, Brajeswar Ghosh, Breno Leitao, Christian Lamparter, Christian Zigotzky, Christophe Leroy, Christoph Hellwig, Corentin Labbe, Daniel Axtens, David Gibson, Diana Craciun, Firoz Khan, Gustavo A. R. Silva, Igor Stoppa, Joe Lawrence, Joel Stanley, Jonathan Neuschäfer, Jordan Niethe, Laurent Dufour, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Masahiro Yamada, Mathieu Malaterre, Matteo Croce, Meelis Roos, Michael W. Bringmann, Nathan Chancellor, Nathan Fontenot, Nicholas Piggin, Nick Desaulniers, Nicolai Stange, Oliver O'Halloran, Paul Mackerras, Peter Xu, PrasannaKumar Muralidharan, Qian Cai, Rashmica Gupta, Reza Arbab, Robert P. J. Day, Russell Currey, Sabyasachi Gupta, Sam Bobroff, Sandipan Das, Sergey Senozhatsky, Souptick Joarder, Stewart Smith, Tyrel Datwyler, Vaibhav Jain, YueHaibing" * tag 'powerpc-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (200 commits) powerpc/32: Clear on-stack exception marker upon exception return powerpc: Remove export of save_stack_trace_tsk_reliable() powerpc/mm: fix "section_base" set but not used powerpc/mm: Fix "sz" set but not used warning powerpc/mm: Check secondary hash page table powerpc: remove nargs from __SYSCALL powerpc/64s: Fix unrelocated interrupt trampoline address test powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables powerpc/fsl: Fix the flush of branch predictor. powerpc/powernv: Make opal log only readable by root powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C powerpc/64s: Fix data interrupts vs d-side MCE reentrancy powerpc/64s: Prepare to handle data interrupts vs d-side MCE reentrancy powerpc/64s: system reset interrupt preserve HSRRs powerpc/64s: Fix HV NMI vs HV interrupt recoverability test powerpc/mm/hash: Handle mmap_min_addr correctly in get_unmapped_area topdown search powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback selftests/powerpc: Remove duplicate header powerpc sstep: Add support for modsd, modud instructions ...
2019-03-05numa: make "nr_node_ids" unsigned intAlexey Dobriyan1-1/+1
Number of NUMA nodes can't be negative. This saves a few bytes on x86_64: add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238) Function old new delta hv_synic_alloc.cold 88 110 +22 prealloc_shrinker 260 262 +2 bootstrap 249 251 +2 sched_init_numa 1566 1567 +1 show_slab_objects 778 777 -1 s_show 1201 1200 -1 kmem_cache_init 346 345 -1 __alloc_workqueue_key 1146 1145 -1 mem_cgroup_css_alloc 1614 1612 -2 __do_sys_swapon 4702 4699 -3 __list_lru_init 655 651 -4 nic_probe 2379 2374 -5 store_user_store 118 111 -7 red_zone_store 106 99 -7 poison_store 106 99 -7 wq_numa_init 348 338 -10 __kmem_cache_empty 75 65 -10 task_numa_free 186 173 -13 merge_across_nodes_store 351 336 -15 irq_create_affinity_masks 1261 1246 -15 do_numa_crng_init 343 321 -22 task_numa_fault 4760 4737 -23 swapfile_init 179 156 -23 hv_synic_alloc 536 492 -44 apply_wqattrs_prepare 746 695 -51 Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: replace all open encodings for NUMA_NO_NODEAnshuman Khandual1-7/+7
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30powerpc/pseries: Perform full re-add of CPU for topology update post-migrationNathan Fontenot1-8/+1
On pseries systems, performing a partition migration can result in altering the nodes a CPU is assigned to on the destination system. For exampl, pre-migration on the source system CPUs are in node 1 and 3, post-migration on the destination system CPUs are in nodes 2 and 3. Handling the node change for a CPU can cause corruption in the slab cache if we hit a timing where a CPUs node is changed while cache_reap() is invoked. The corruption occurs because the slab cache code appears to rely on the CPU and slab cache pages being on the same node. The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c does not prevent us from hitting this scenario. Changing the device tree property update notification handler that recognizes an affinity change for a CPU to do a full DLPAR remove and add of the CPU instead of dynamically changing its node resolves this issue. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com> Tested-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-11-26powerpc: Use device_type helpers to access the node typeRob Herring1-1/+1
Remove directly accessing device_node.type pointer and use the accessors instead. This will eventually allow removing the type pointer. Replace the open coded iterating over child nodes with for_each_child_of_node() while we're here. Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-11-14powerpc/numa: Suppress "VPHN is not supported" messagesSatheesh Rajendran1-1/+1
When VPHN function is not supported and during cpu hotplug event, kernel prints message 'VPHN function not supported. Disabling polling...'. Currently it prints on every hotplug event, it floods dmesg when a KVM guest tries to hotplug huge number of vcpus, let's just print once and suppress further kernel prints. Signed-off-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport1-2/+1
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*Mike Rapoport1-1/+1
Make it explicit that the caller gets a physical address rather than a virtual one. This will also allow using meblock_alloc prefix for memblock allocations returning virtual address, which is done in the following patches. The conversion is done using the following semantic patch: @@ expression e1, e2, e3; @@ ( - memblock_alloc(e1, e2) + memblock_phys_alloc(e1, e2) | - memblock_alloc_nid(e1, e2, e3) + memblock_phys_alloc_nid(e1, e2, e3) | - memblock_alloc_try_nid(e1, e2, e3) + memblock_phys_alloc_try_nid(e1, e2, e3) ) Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-13powerpc/pseries/mobility: Extend start/stop topology update scopeMichael Bringmann1-0/+6
The powerpc mobility code may receive RTAS requests to perform PRRN (Platform Resource Reassignment Notification) topology changes at any time, including during LPAR migration operations. In some configurations where the affinity of CPUs or memory is being changed on that platform, the PRRN requests may apply or refer to outdated information prior to the complete update of the device-tree. This patch changes the duration for which topology updates are suppressed during LPAR migrations from just the rtas_ibm_suspend_me() / 'ibm,suspend-me' call(s) to cover the entire migration_store() operation to allow all changes to the device-tree to be applied prior to accepting and applying any PRRN requests. For tracking purposes, pr_info notices are added to the functions start_topology_update() and stop_topology_update() of 'numa.c'. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-05powerpc/numa: Skip onlining a offline node in kdump pathSrikar Dronamraju1-2/+3
With commit 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot"), kdump kernel on shared LPAR may crash. The necessary conditions are - Shared LPAR with at least 2 nodes having memory and CPUs. - Memory requirement for kdump kernel must be met by the first N-1 nodes where there are at least N nodes with memory and CPUs. Example numactl of such a machine. $ numactl -H available: 5 nodes (0,2,5-7) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: node 2 size: 255 MB node 2 free: 189 MB node 5 cpus: 24 25 26 27 28 29 30 31 node 5 size: 4095 MB node 5 free: 4024 MB node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 6 size: 6353 MB node 6 free: 5998 MB node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 node 7 size: 7640 MB node 7 free: 7164 MB node distances: node 0 2 5 6 7 0: 10 40 40 40 40 2: 40 10 40 40 40 5: 40 40 10 40 40 6: 40 40 40 10 20 7: 40 40 40 20 10 Steps to reproduce. 1. Load / start kdump service. 2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger) When booting a kdump kernel with 2048M: kexec: Starting switchover sequence. I'm in purgatory Using 1TB segments hash-mmu: Initializing hash mmu with SLB Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018 Found initrd at 0xc000000009e70000:0xc00000000ae554b4 Using pSeries machine description ----------------------------------------------------- ppc64_pft_size = 0x1e phys_mem_size = 0x88000000 dcache_bsize = 0x80 icache_bsize = 0x80 cpu_features = 0x000000ff8f5d91a7 possible = 0x0000fbffcf5fb1a7 always = 0x0000006f8b5c91a1 cpu_user_features = 0xdc0065c2 0xef000000 mmu_features = 0x7c006001 firmware_features = 0x00000007c45bfc57 htab_hash_mask = 0x7fffff physical_start = 0x8000000 ----------------------------------------------------- numa: NODE_DATA [mem 0x87d5e300-0x87d67fff] numa: NODE_DATA(0) on node 6 numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff] Top of RAM: 0x88000000, Total RAM: 0x88000000 Memory hole size: 0MB Zone ranges: DMA [mem 0x0000000000000000-0x0000000087ffffff] DMA32 empty Normal empty Movable zone start for each node Early memory node ranges node 6: [mem 0x0000000000000000-0x0000000087ffffff] Could not find start_pfn for node 0 Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000] On node 0 totalpages: 0 Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff] On node 6 totalpages: 34816 Unable to handle kernel paging request for data at address 0x00000060 Faulting instruction address: 0xc000000008703a54 Oops: Kernel access of bad area, sig: 11 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1 NIP: c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000 REGS: c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24022022 XER: 20000002 CFAR: c0000000086fc238 IRQMASK: 0 GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000 GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220 GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008 GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000 GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008 GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78 GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400 NIP [c000000008703a54] bus_add_device+0x84/0x1e0 LR [c000000008703a38] bus_add_device+0x68/0x1e0 Call Trace: [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable) [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0 [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240 [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180 [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90 [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190 [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580 [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108 [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450 [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160 [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80 Instruction dump: 60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378 ---[ end trace 593577668c2daa65 ]--- However a regular kernel with 4096M (2048 gets reserved for crash kernel) boots properly. Unlike regular kernels, which mark all available nodes as online, kdump kernel only marks just enough nodes as online and marks the rest as offline at boot. However kdump kernel boots with all available CPUs. With Commit 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot"), all CPUs are onlined on their respective nodes at boot time. try_online_node() tries to online the offline nodes but fails as all needed subsystems are not yet initialized. As part of fix, detect and skip early onlining of a offline node. Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot") Reported-by: Pavithra Prakash <pavrampu@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-25powerpc/numa: Use associativity if VPHN hcall is successfulSrikar Dronamraju1-1/+3
Currently associativity is used to lookup node-id even if the preceding VPHN hcall failed. However this can cause CPU to be made part of the wrong node, (most likely to be node 0). This is because VPHN is not enabled on KVM guests. With 2ea6263 ("powerpc/topology: Get topology for shared processors at boot"), associativity is used to set to the wrong node. Hence KVM guest topology is broken. For example : A 4 node KVM guest before would have reported. [root@localhost ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 node 0 size: 1746 MB node 0 free: 1604 MB node 1 cpus: 4 5 6 7 node 1 size: 2044 MB node 1 free: 1765 MB node 2 cpus: 8 9 10 11 node 2 size: 2044 MB node 2 free: 1837 MB node 3 cpus: 12 13 14 15 node 3 size: 2044 MB node 3 free: 1903 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 Would now report: [root@localhost ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 1746 MB node 0 free: 1244 MB node 1 cpus: node 1 size: 2044 MB node 1 free: 2032 MB node 2 cpus: 1 node 2 size: 2044 MB node 2 free: 2028 MB node 3 cpus: node 3 size: 2044 MB node 3 free: 2032 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 Fix this by skipping associativity lookup if the VPHN hcall failed. Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot") Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-24powerpc/pseries: Fix unitialized timer reset on migrationMichael Bringmann1-1/+2
After migration of a powerpc LPAR, the kernel executes code to update the system state to reflect new platform characteristics. Such changes include modifications to device tree properties provided to the system by PHYP. Property notifications received by the post_mobility_fixup() code are passed along to the kernel in general through a call to of_update_property() which in turn passes such events back to all modules through entries like the '.notifier_call' function within the NUMA module. When the NUMA module updates its state, it resets its event timer. If this occurs after a previous call to stop_topology_update() or on a system without VPHN enabled, the code runs into an unitialized timer structure and crashes. This patch adds a safety check along this path toward the problem code. An example crash log is as follows. ibmvscsi 30000081: Re-enabling adapter! ------------[ cut here ]------------ kernel BUG at kernel/time/timer.c:958! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179 ... NIP mod_timer+0x4c/0x400 LR reset_topology_timer+0x40/0x60 Call Trace: 0xc0000003f9407830 (unreliable) reset_topology_timer+0x40/0x60 dt_update_callback+0x100/0x120 notifier_call_chain+0x90/0x100 __blocking_notifier_call_chain+0x60/0x90 of_property_notify+0x90/0xd0 of_update_property+0x104/0x150 update_dt_property+0xdc/0x1f0 pseries_devicetree_update+0x2d0/0x510 post_mobility_fixup+0x7c/0xf0 migration_store+0xa4/0xc0 kobj_attr_store+0x30/0x60 sysfs_kf_write+0x64/0xa0 kernfs_fop_write+0x16c/0x240 __vfs_write+0x40/0x200 vfs_write+0xc8/0x240 ksys_write+0x5c/0x100 system_call+0x58/0x6c Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated") Cc: stable@vger.kernel.org # v3.10+ Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-21powerpc/topology: Get topology for shared processors at bootSrikar Dronamraju1-10/+10
On a shared LPAR, Phyp will not update the CPU associativity at boot time. Just after the boot system does recognize itself as a shared LPAR and trigger a request for correct CPU associativity. But by then the scheduler would have already created/destroyed its sched domains. This causes - Broken load balance across Nodes causing islands of cores. - Performance degradation esp if the system is lightly loaded - dmesg to wrongly report all CPUs to be in Node 0. - Messages in dmesg saying borken topology. - With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain"), can cause rcu stalls at boot up. The sched_domains_numa_masks table which is used to generate cpumasks is only created at boot time just before creating sched domains and never updated. Hence, its better to get the topology correct before the sched domains are created. For example on 64 core Power 8 shared LPAR, dmesg reports Brought up 512 CPUs Node 0 CPUs: 0-511 Node 1 CPUs: Node 2 CPUs: Node 3 CPUs: Node 4 CPUs: Node 5 CPUs: Node 6 CPUs: Node 7 CPUs: Node 8 CPUs: Node 9 CPUs: Node 10 CPUs: Node 11 CPUs: ... BUG: arch topology borken the DIE domain not a subset of the NUMA domain BUG: arch topology borken the DIE domain not a subset of the NUMA domain numactl/lscpu output will still be correct with cores spreading across all nodes: Socket(s): 64 NUMA node(s): 12 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 Currently on this LPAR, the scheduler detects 2 levels of Numa and created numa sched domains for all CPUs, but it finds a single DIE domain consisting of all CPUs. Hence it deletes all numa sched domains. To address this, detect the shared processor and update topology soon after CPUs are setup so that correct topology is updated just before scheduler creates sched domain. With the fix, dmesg reports: numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471 numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479 numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487 numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495 numa: Node 4 CPUs: 208-215 304-311 400-407 496-503 numa: Node 5 CPUs: 168-175 264-271 360-367 456-463 numa: Node 6 CPUs: 128-135 224-231 320-327 416-423 numa: Node 7 CPUs: 136-143 232-239 328-335 424-431 numa: Node 8 CPUs: 216-223 312-319 408-415 504-511 numa: Node 9 CPUs: 144-151 240-247 336-343 432-439 numa: Node 10 CPUs: 152-159 248-255 344-351 440-447 numa: Node 11 CPUs: 160-167 256-263 352-359 448-455 and lscpu also reports: Socket(s): 64 NUMA node(s): 12 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 Reported-by: Manjunatha H R <manjuhr1@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [mpe: Trim / format change log] Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-12treewide: kzalloc() -> kcalloc()Kees Cook1-1/+1
The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-31Merge branch 'topic/paca' into nextMichael Ellerman1-15/+21
Bring in yet another series that touches KVM code, and might need to be merged into the kvm-ppc branch to resolve conflicts. This required some changes in pnv_power9_force_smt4_catch/release() due to the paca array becomming an array of pointers.
2018-03-30powerpc/mm/numa: move numa topology discovery earlierNicholas Piggin1-13/+19
Split sparsemem initialisation from basic numa topology discovery. Move the parsing earlier in boot, before pacas are allocated. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-30powerpc/64s: Allocate LPPACAs individuallyNicholas Piggin1-2/+2
We no longer allocate lppacas in an array, so this patch removes the 1kB static alignment for the structure, and enforces the PAPR alignment requirements at allocation time. We can not reduce the 1kB allocation size however, due to existing KVM hypervisors. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-08powerpc/numa: Invalidate numa_cpu_lookup_table on cpu removeNathan Fontenot1-5/+0
When DLPAR removing a CPU, the unmapping of the cpu from a node in unmap_cpu_from_node() should also invalidate the CPUs entry in the numa_cpu_lookup_table. There is not a guarantee that on a subsequent DLPAR add of the CPU the associativity will be the same and thus could be in a different node. Invalidating the entry in the numa_cpu_lookup_table causes the associativity to be read from the device tree at the time of the add. The current behavior of not invalidating the CPUs entry in the numa_cpu_lookup_table can result in scenarios where the the topology layout of CPUs in the partition does not match the device tree or the topology reported by the HMC. This bug looks like it was introduced in 2004 in the commit titled "ppc64: cpu hotplug notifier for numa", which is 6b15e4e87e32 in the linux-fullhist tree. Hence tag it for all stable releases. Cc: stable@vger.kernel.org Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-27powerpc/pseries: Fix cpu hotplug crash with memoryless nodesMichael Bringmann1-1/+3
On powerpc systems with shared configurations of CPUs and memory and memoryless nodes at boot, an event ordering problem was observed on a SLES12 build platforms with the hot-add of CPUs to the memoryless nodes. * The most common error occurred when the memory SLAB driver attempted to reference the memoryless node to which a CPU was being added before the kernel had finished initializing all of the data structures for the CPU and exited 'device_online' under DLPAR/hot-add. Normally the memoryless node would be initialized through the call path device_online ... arch_update_cpu_topology ... find_cpu_nid ... try_online_node. This patch ensures that the powerpc node will be initialized as early as possible, even if it was memoryless and CPU-less at the point when we are trying to hot-add a new CPU to it. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-27powerpc/numa: Ensure nodes initialized for hotplugMichael Bringmann1-10/+37
This patch fixes some problems encountered at runtime with configurations that support memory-less nodes, or that hot-add CPUs into nodes that are memoryless during system execution after boot. The problems of interest include: * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting the references to one of the set of nodes known to have memory at boot. Note that this software operates under the context of CPU hotplug. We are not doing memory hotplug in this code, but rather updating the kernel's CPU topology (i.e. arch_update_cpu_topology / numa_update_cpu_topology). We are initializing a node that may be used by CPUs or memory before it can be referenced as invalid by a CPU hotplug operation. CPU hotplug operations are protected by a range of APIs including cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock, device locks, and more. Memory hotplug operations, including try_online_node, are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the case of CPUs being hot-added to a previously memoryless node, the try_online_node operation occurs wholly within the CPU locks with no overlap. Using HMC hot-add/hot-remove operations, we have been able to add and remove CPUs to any possible node without failures. HMC operations involve a degree self-serialization, though. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-27powerpc/numa: Use ibm,max-associativity-domains to discover possible nodesMichael Bringmann1-3/+34
On powerpc systems which allow 'hot-add' of CPU or memory resources, it may occur that the new resources are to be inserted into nodes that were not used for these resources at bootup. In the kernel, any node that is used must be defined and initialized. These empty nodes may occur when, * Dedicated vs. shared resources. Shared resources require information such as the VPHN hcall for CPU assignment to nodes. Associativity decisions made based on dedicated resource rules, such as associativity properties in the device tree, may vary from decisions made using the values returned by the VPHN hcall. * memoryless nodes at boot. Nodes need to be defined as 'possible' at boot for operation with other code modules. Previously, the powerpc code would limit the set of possible nodes to those which have memory assigned at boot, and were thus online. Subsequent add/remove of CPUs or memory would only work with this subset of possible nodes. * memoryless nodes with CPUs at boot. Due to the previous restriction on nodes, nodes that had CPUs but no memory were being collapsed into other nodes that did have memory at boot. In practice this meant that the node assignment presented by the runtime kernel differed from the affinity and associativity attributes presented by the device tree or VPHN hcalls. Nodes that might be known to the pHyp were not 'possible' in the runtime kernel because they did not have memory at boot. This patch ensures that sufficient nodes are defined to support configuration requirements after boot, as well as at boot. This patch set fixes a couple of problems. * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting to one of the set of nodes that were known to have memory at boot. This patch extracts the value of the lowest domain level (number of allocable resources) from the device tree property "ibm,max-associativity-domains" to use as the maximum number of nodes to setup as possibly available in the system. This new setting will override the instruction: nodes_and(node_possible_map, node_possible_map, node_online_map); presently seen in the function arch/powerpc/mm/numa.c:initmem_init(). If the "ibm,max-associativity-domains" property is not present at boot, no operation will be performed to define or enable additional nodes, or enable the above 'nodes_and()'. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/numa: Update numa code use walk_drmem_lmbsNathan Fontenot1-171/+52
Update code in powerpc/numa.c to use the walk_drmem_lmbs() routine instead of parsing the device tree directly. This is in anticipation of introducing a new ibm,dynamic-memory-v2 property with a different format. This will allow the numa code to use a single initialization routine per-LMB irregardless of the device tree format. Additionally, to support additional routines in numa.c that need to look up LMB information, an late_init routine is added to drmem.c to allocate the array of LMB information. This LMB array will provide per-LMB information to separate the LMB data from the device tree format. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/numa: Look up associativity array in of_drconf_to_nid_singleNathan Fontenot1-23/+17
Look up the associativity arrays in of_drconf_to_nid_single when deriving the nid for a LMB instead of having it passed in as a parameter. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/numa: Look up device node in of_get_usable_memory()Nathan Fontenot1-2/+10
Look up the device node for the usable memory property instead of having it passed in as a parameter. This changes precedes an update in which the calling routines for of_get_usable_memory() will not have the device node pointer to pass in. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16powerpc/numa: Look up device node in of_get_assoc_arrays()Nathan Fontenot1-5/+13
Look up the device node for the associativity array property instead of having it passed in as a parameter. This changes precedes an update in which the calling routines for of_get_assoc_arrays() will not have the device node pointer to pass in. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-16Merge tag 'powerpc-4.15-1' of ↵Linus Torvalds1-8/+55
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "A bit of a small release, I suspect in part due to me travelling for KS. But my backlog of patches to review is smaller than usual, so I think in part folks just didn't send as much this cycle. Non-highlights: - Five fixes for the >128T address space handling, both to fix bugs in our implementation and to bring the semantics exactly into line with x86. Highlights: - Support for a new OPAL call on bare metal machines which gives us a true NMI (ie. is not masked by MSR[EE]=0) for debugging etc. - Support for Power9 DD2 in the CXL driver. - Improvements to machine check handling so that uncorrectable errors can be reported into the generic memory_failure() machinery. - Some fixes and improvements for VPHN, which is used under PowerVM to notify the Linux partition of topology changes. - Plumbing to enable TM (transactional memory) without suspend on some Power9 processors (PPC_FEATURE2_HTM_NO_SUSPEND). - Support for emulating vector loads form cache-inhibited memory, on some Power9 revisions. - Disable the fast-endian switch "syscall" by default (behind a CONFIG), we believe it has never had any users. - A major rework of the API drivers use when initiating and waiting for long running operations performed by OPAL firmware, and changes to the powernv_flash driver to use the new API. - Several fixes for the handling of FP/VMX/VSX while processes are using transactional memory. - Optimisations of TLB range flushes when using the radix MMU on Power9. - Improvements to the VAS facility used to access coprocessors on Power9, and related improvements to the way the NX crypto driver handles requests. - Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit. Thanks to: Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Breno Leitao, Christophe Leroy, Christophe Lombard, Cyril Bur, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo Romero, Haren Myneni, Joel Stanley, Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami Hiramatsu, Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee, Shriya, Stephen Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, and William A. Kennington III" * tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (151 commits) powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature powerpc/64s: Fix masking of SRR1 bits on instruction fault powerpc/64s: mm_context.addr_limit is only used on hash powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary powerpc/64s/hash: Fix fork() with 512TB process address space powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation powerpc/64s/hash: Fix 512T hint detection to use >= 128T powerpc: Fix DABR match on hash based systems powerpc/signal: Properly handle return value from uprobe_deny_signal() powerpc/fadump: use kstrtoint to handle sysfs store powerpc/lib: Implement UACCESS_FLUSHCACHE API powerpc/lib: Implement PMEM API powerpc/powernv/npu: Don't explicitly flush nmmu tlb powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm() powerpc/powernv/idle: Round up latency and residency values powerpc/kprobes: refactor kprobe_lookup_name for safer string operations powerpc/kprobes: Blacklist emulate_update_regs() from kprobes powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftrace powerpc/kprobes: Disable preemption before invoking probe handler for optprobes ...
2017-11-13Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-7/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Yet another big pile of changes: - More year 2038 work from Arnd slowly reaching the point where we need to think about the syscalls themself. - A new timer function which allows to conditionally (re)arm a timer only when it's either not running or the new expiry time is sooner than the armed expiry time. This allows to use a single timer for multiple timeout requirements w/o caring about the first expiry time at the call site. - A new NMI safe accessor to clock real time for the printk timestamp work. Can be used by tracing, perf as well if required. - A large number of timer setup conversions from Kees which got collected here because either maintainers requested so or they simply got ignored. As Kees pointed out already there are a few trivial merge conflicts and some redundant commits which was unavoidable due to the size of this conversion effort. - Avoid a redundant iteration in the timer wheel softirq processing. - Provide a mechanism to treat RTC implementations depending on their hardware properties, i.e. don't inflict the write at the 0.5 seconds boundary which originates from the PC CMOS RTC to all RTCs. No functional change as drivers need to be updated separately. - The usual small updates to core code clocksource drivers. Nothing really exciting" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (111 commits) timers: Add a function to start/reduce a timer pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday() timer: Prepare to change all DEFINE_TIMER() callbacks netfilter: ipvs: Convert timers to use timer_setup() scsi: qla2xxx: Convert timers to use timer_setup() block/aoe: discover_timer: Convert timers to use timer_setup() ide: Convert timers to use timer_setup() drbd: Convert timers to use timer_setup() mailbox: Convert timers to use timer_setup() crypto: Convert timers to use timer_setup() drivers/pcmcia: omap1: Fix error in automated timer conversion ARM: footbridge: Fix typo in timer conversion drivers/sgi-xp: Convert timers to use timer_setup() drivers/pcmcia: Convert timers to use timer_setup() drivers/memstick: Convert timers to use timer_setup() drivers/macintosh: Convert timers to use timer_setup() hwrng/xgene-rng: Convert timers to use timer_setup() auxdisplay: Convert timers to use timer_setup() sparc/led: Convert timers to use timer_setup() mips: ip22/32: Convert timers to use timer_setup() ...
2017-11-10Merge branch 'fixes' into nextMichael Ellerman1-1/+0
We have some dependencies & conflicts between patches in fixes and things to go in next, both in the radix TLB flush code and the IMC PMU driver. So merge fixes into next.
2017-10-16powerpc/vphn: Fix numa update end-loop bugMichael Bringmann1-2/+8
powerpc/vphn: On Power systems with shared configurations of CPUs and memory, there are some issues with the association of additional CPUs and memory to nodes when hot-adding resources. This patch fixes an end-of-updates processing problem observed occasionally in numa_update_cpu_topology(). Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-16powerpc/hotplug: Improve responsiveness of hotplug changeMichael Bringmann1-1/+22
powerpc/hotplug: On Power systems with shared configurations of CPUs and memory, there are some issues with the association of additional CPUs and memory to nodes when hot-adding resources. During hotplug CPU operations, this patch resets the timer on topology update work function to a small value to better ensure that the CPU topology is detected and configured sooner. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-16powerpc/vphn: Improve recognition of PRRN/VPHNMichael Bringmann1-4/+4
powerpc/vphn: On Power systems with shared configurations of CPUs and memory, there are some issues with the association of additional CPUs and memory to nodes when hot-adding resources. This patch updates the initialization checks to independently recognize PRRN or VPHN support. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-16powerpc/vphn: Update CPU topology when VPHN enabledMichael Bringmann1-1/+21
powerpc/vphn: On Power systems with shared configurations of CPUs and memory, there are some issues with the association of additional CPUs and memory to nodes when hot-adding resources. This patch corrects the currently broken capability to set the topology for shared CPUs in LPARs. At boot time for shared CPU lpars, the topology for each CPU was being set to node zero. Now when numa_update_cpu_topology() is called appropriately, the Virtual Processor Home Node (VPHN) capabilities information provided by the pHyp allows the appropriate node in the shared configuration to be selected for the CPU. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-10powerpc: Don't call lockdep_assert_cpus_held() from arch_update_cpu_topology()Thiago Jung Bauermann1-1/+0
It turns out that not all paths calling arch_update_cpu_topology() hold cpu_hotplug_lock, but that's OK because those paths can't race with any concurrent hotplug events. Warnings were reported with the following trace: lockdep_assert_cpus_held arch_update_cpu_topology sched_init_domains sched_init_smp kernel_init_freeable kernel_init ret_from_kernel_thread Which is safe because it's called early in boot when hotplug is not live yet. And also this trace: lockdep_assert_cpus_held arch_update_cpu_topology partition_sched_domains cpuset_update_active_cpus sched_cpu_deactivate cpuhp_invoke_callback cpuhp_down_callbacks cpuhp_thread_fun smpboot_thread_fn kthread ret_from_kernel_thread Which is safe because it's called as part of CPU hotplug, so although we don't hold the CPU hotplug lock, there is another thread driving the CPU hotplug operation which does hold the lock, and there is no race. Thanks to tglx for deciphering it for us. Fixes: 3e401f7a2e51 ("powerpc: Only obtain cpu_hotplug_lock if called by rtasd") Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-05timer: Remove init_timer_deferrable() in favor of timer_setup()Kees Cook1-7/+5
This refactors the only users of init_timer_deferrable() to use the new timer_setup() and from_timer(). Removes definition of init_timer_deferrable(). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David S. Miller <davem@davemloft.net> # for networking parts Acked-by: Sebastian Reichel <sre@kernel.org> # for drivers/hsi parts Cc: linux-mips@linux-mips.org Cc: Petr Mladek <pmladek@suse.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kalle Valo <kvalo@qca.qualcomm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux1394-devel@lists.sourceforge.net Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: linux-s390@vger.kernel.org Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ursula Braun <ubraun@linux.vnet.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Harish Patil <harish.patil@cavium.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Manish Chopra <manish.chopra@cavium.com> Cc: Len Brown <len.brown@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-pm@vger.kernel.org Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Gross <mark.gross@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-watchdog@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-wireless@vger.kernel.org Cc: Sebastian Reichel <sre@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Michael Reed <mdr@sgi.com> Cc: netdev@vger.kernel.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Link: https://lkml.kernel.org/r/1507159627-127660-6-git-send-email-keescook@chromium.org
2017-06-23powerpc: Only obtain cpu_hotplug_lock if called by rtasdThiago Jung Bauermann1-3/+19
Calling arch_update_cpu_topology from a CPU hotplug state machine callback hits a deadlock because the function tries to get a read lock on cpu_hotplug_lock while the state machine still holds a write lock on it. Since all callers of arch_update_cpu_topology except rtasd already hold cpu_hotplug_lock, this patch changes the function to use stop_machine_cpuslocked and creates a separate function for rtasd which still tries to obtain the lock. Michael Bringmann investigated the bug and provided a detailed analysis of the deadlock on this previous RFC for an alternate solution: Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: John Allen <jallen@linux.vnet.ibm.com> Cc: Michael Bringmann <mwb@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1497996510-4032-1-git-send-email-bauerman@linux.vnet.ibm.com Link: https://patchwork.ozlabs.org/patch/771293/
2017-04-11powerpc/mm: Remove reduntant initmem information from logAnshuman Khandual1-7/+0
Generic core VM already prints these information in the log buffer, hence there is no need for a second print. This just removes the second print from arch powerpc NUMA init path. Before the patch: $ dmesg | grep "Initmem" numa: Initmem setup node 0 [mem 0x00000000-0xffffffff] numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff] numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff] numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff] numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff] numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff] numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff] numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff] Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff] Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff] Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff] Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff] Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff] Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff] Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff] Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff] After the patch just the latter set is printed. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10powerpc: Drop GPL from of_node_to_nid() export to match other archesShailendra Singh1-1/+1
The generic implementation of of_node_to_nid() is EXPORT_SYMBOL, added in commit 298535c00a2c ("of, numa: Add NUMA of binding implementation."). The powerpc implementation added in commit 953039c8df7b ("[PATCH] powerpc: Allow devices to register with numa topology") is EXPORT_SYMBOL_GPL. This creates an inconsistency for of_node_to_nid() callers across architectures. Update the powerpc implementation to be exported consistently with the generic implementation. Signed-off-by: Shailendra Singh <shailendras@nvidia.com> Reviewed-by: Andy Ritger <aritger@nvidia.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30powerpc/mm: Allow memory hotplug into an offline nodeReza Arbab1-1/+1
Relax the check preventing us from hotplugging into an offline node. This limitation was added in commit 482ec7c403d2 ("[PATCH] powerpc numa: Support sparse online node map") to prevent adding resources to an uninitialized node. These days, there is no harm in doing so. The addition will actually cause the node to be initialized and onlined; add_memory_resource() calls hotadd_new_pgdat() (if necessary) and node_set_online(). Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30powerpc/mm: Simplify loop control in parse_numa_properties()Reza Arbab1-8/+3
The flow of the main loop in parse_numa_properties() is overly complicated. Simplify it to be less confusing and easier to read. No functional change. The end of the main loop in parse_numa_properties() looks like this: for_each_node_by_type(...) { ... if (!condition) { if (--ranges) goto new_range; else continue; } statement(); if (--ranges) goto new_range; /* else * continue; <- implicit, this is the end of the loop */ } The only effect of !condition is to skip execution of statement(). This can be rewritten in a simpler way: for_each_node_by_type(...) { ... if (condition) statement(); if (--ranges) goto new_range; } Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-25cpu/hotplug: Cleanup state namesThomas Gleixner1-1/+1
When the state names got added a script was used to add the extra argument to the calls. The script basically converted the state constant to a string, but the cleanup to convert these strings into meaningful ones did not happen. Replace all the useless strings with 'subsys/xxx/yyy:state' strings which are used in all the other places already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-12powerpc/mm: allow memory hotplug into a memoryless nodeReza Arbab1-12/+1
Patch series "enable movable nodes on non-x86 configs", v7. This patchset allows more configs to make use of movable nodes. When CONFIG_MOVABLE_NODE is selected, there are two ways to introduce such nodes into the system: 1. Discover movable nodes at boot. Currently this is only possible on x86, but we will enable configs supporting fdt to do the same. 2. Hotplug and online all of a node's memory using online_movable. This is already possible on any config supporting memory hotplug, not just x86, but the Kconfig doesn't say so. We will fix that. We'll also remove some cruft on power which would prevent (2). This patch (of 5): Remove the check which prevents us from hotplugging into an empty node. The original commit b226e4621245 ("[PATCH] powerpc: don't add memory to empty node/zone"), states that this was intended to be a temporary measure. It is a workaround for an oops which no longer occurs. Link: http://lkml.kernel.org/r/1479160961-25840-2-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alistair Popple <apopple@au1.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Stewart Smith <stewart@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19powerpc: Fix numa topology console printAneesh Kumar K.V1-5/+5
With recent update to printk, we get console output like below: [ 0.550639] Brought up 160 CPUs [ 0.550718] Node 0 CPUs: [ 0.550721] 0 [ 0.550754] -39 [ 0.550794] Node 1 CPUs: [ 0.550798] 40 [ 0.550817] -79 [ 0.550856] Node 16 CPUs: [ 0.550860] 80 [ 0.550880] -119 [ 0.550917] Node 17 CPUs: [ 0.550923] 120 [ 0.550942] -159 Fix this by properly using pr_cont(), ie. KERN_CONT. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-19powerpc/mm: Drop dump_numa_memory_topology()Michael Ellerman1-36/+0
At boot we dump the NUMA memory topology in dump_numa_memory_topology(), at KERN_DEBUG level, resulting in output like: Node 0 Memory: 0x0-0x100000000 Node 1 Memory: 0x100000000-0x200000000 Which is nice enough, but immediately after that we iterate over each node and call setup_node_data(), which also prints out the node ranges, at KERN_INFO, giving eg: numa: Initmem setup node 0 [mem 0x00000000-0xffffffff] numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff] Additionally dump_numa_memory_topology() does not use KERN_CONT correctly, resulting in split output lines on recent kernels. So drop dump_numa_memory_topology() as superfluous chatter. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-30Merge tag 'powerpc-4.8-1' of ↵Linus Torvalds1-10/+26
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights: - PowerNV PCI hotplug support. - Lots more Power9 support. - eBPF JIT support on ppc64le. - Lots of cxl updates. - Boot code consolidation. Bug fixes: - Fix spin_unlock_wait() from Boqun Feng - Fix stack pointer corruption in __tm_recheckpoint() from Michael Neuling - Fix multiple bugs in memory_hotplug_max() from Bharata B Rao - mm: Ensure "special" zones are empty from Oliver O'Halloran - ftrace: Separate the heuristics for checking call sites from Michael Ellerman - modules: Never restore r2 for a mprofile-kernel style mcount() call from Michael Ellerman - Fix endianness when reading TCEs from Alexey Kardashevskiy - start rtasd before PCI probing from Greg Kurz - PCI: rpaphp: Fix slot registration for multiple slots under a PHB from Tyrel Datwyler - powerpc/mm: Add memory barrier in __hugepte_alloc() from Sukadev Bhattiprolu Cleanups & fixes: - Drop support for MPIC in pseries from Rashmica Gupta - Define and use PPC64_ELF_ABI_v2/v1 from Michael Ellerman - Remove unused symbols in asm-offsets.c from Rashmica Gupta - Fix SRIOV not building without EEH enabled from Russell Currey - Remove kretprobe_trampoline_holder from Thiago Jung Bauermann - Reduce log level of PCI I/O space warning from Benjamin Herrenschmidt - Add array bounds checking to crash_shutdown_handlers from Suraj Jitindar Singh - Avoid -maltivec when using clang integrated assembler from Anton Blanchard - Fix array overrun in ppc_rtas() syscall from Andrew Donnellan - Fix error return value in cmm_mem_going_offline() from Rasmus Villemoes - export cpu_to_core_id() from Mauricio Faria de Oliveira - Remove old symbols from defconfigs from Andrew Donnellan - Update obsolete comments in setup_32.c about entry conditions from Benjamin Herrenschmidt - Add comment explaining the purpose of setup_kdump_trampoline() from Benjamin Herrenschmidt - Merge the RELOCATABLE config entries for ppc32 and ppc64 from Kevin Hao - Remove RELOCATABLE_PPC32 from Kevin Hao - Fix .long's in tlb-radix.c to more meaningful from Balbir Singh Minor cleanups & fixes: - Andrew Donnellan, Anna-Maria Gleixner, Anton Blanchard, Benjamin Herrenschmidt, Bharata B Rao, Christophe Leroy, Colin Ian King, Geliang Tang, Greg Kurz, Madhavan Srinivasan, Michael Ellerman, Michael Ellerman, Stephen Rothwell, Stewart Smith. Freescale updates from Scott: - "Highlights include more 8xx optimizations, device tree updates, and MVME7100 support." PowerNV PCI hotplug from Gavin Shan: - PCI: Add pcibios_setup_bridge() - Override pcibios_setup_bridge() - Remove PCI_RESET_DELAY_US - Move pnv_pci_ioda_setup_opal_tce_kill() around - Increase PE# capacity - Allocate PE# in reverse order - Create PEs in pcibios_setup_bridge() - Setup PE for root bus - Extend PCI bridge resources - Make pnv_ioda_deconfigure_pe() visible - Dynamically release PE - Update bridge windows on PCI plug - Delay populating pdn - Support PCI slot ID - Use PCI slot reset infrastructure - Introduce pnv_pci_get_slot_id() - Functions to get/set PCI slot state - PCI/hotplug: PowerPC PowerNV PCI hotplug driver - Print correct PHB type names Power9 idle support from Shreyas B. Prabhu: - set power_save func after the idle states are initialized - Use PNV_THREAD_WINKLE macro while requesting for winkle - make hypervisor state restore a function - Rename idle_power7.S to idle_book3s.S - Rename reusable idle functions to hardware agnostic names - Make pnv_powersave_common more generic - abstraction for saving SPRs before entering deep idle states - Add platform support for stop instruction - cpuidle/powernv: Use CPUIDLE_STATE_MAX instead of MAX_POWERNV_IDLE_STATES - cpuidle/powernv: cleanup cpuidle-powernv.c - cpuidle/powernv: Add support for POWER ISA v3 idle states - Use deepest stop state when cpu is offlined Power9 PMU from Madhavan Srinivasan: - factor out power8 pmu macros and defines - factor out power8 pmu functions - factor out power8 __init_pmu code - Add power9 event list macros for generic and cache events - Power9 PMU support - Export Power9 generic and cache events to sysfs Power9 preliminary interrupt & PCI support from Benjamin Herrenschmidt: - Add XICS emulation APIs - Move a few exception common handlers to make room - Add support for HV virtualization interrupts - Add mechanism to force a replay of interrupts - Add ICP OPAL backend - Discover IODA3 PHBs - pci: Remove obsolete SW invalidate - opal: Add real mode call wrappers - Rename TCE invalidation calls - Remove SWINV constants and obsolete TCE code - Rework accessing the TCE invalidate register - Fallback to OPAL for TCE invalidations - Use the device-tree to get available range of M64's - Check status of a PHB before using it - pci: Don't try to allocate resources that will be reassigned Other Power9: - Send SIGBUS on unaligned copy and paste from Chris Smart - Large Decrementer support from Oliver O'Halloran - Load Monitor Register Support from Jack Miller Performance improvements from Anton Blanchard: - Avoid load hit store in __giveup_fpu() and __giveup_altivec() - Avoid load hit store in setup_sigcontext() - Remove assembly versions of strcpy, strcat, strlen and strcmp - Align hot loops of some string functions eBPF JIT from Naveen N. Rao: - Fix/enhance 32-bit Load Immediate implementation - Optimize 64-bit Immediate loads - Introduce rotate immediate instructions - A few cleanups - Isolate classic BPF JIT specifics into a separate header - Implement JIT compiler for extended BPF Operator Panel driver from Suraj Jitindar Singh: - devicetree/bindings: Add binding for operator panel on FSP machines - Add inline function to get rc from an ASYNC_COMP opal_msg - Add driver for operator panel on FSP machines Sparse fixes from Daniel Axtens: - make some things static - Introduce asm-prototypes.h - Include headers containing prototypes - Use #ifdef __BIG_ENDIAN__ #else for REG_BYTE - kvm: Clarify __user annotations - Pass endianness to sparse - Make ppc_md.{halt, restart} __noreturn MM fixes & cleanups from Aneesh Kumar K.V: - radix: Update LPCR HR bit as per ISA - use _raw variant of page table accessors - Compile out radix related functions if RADIX_MMU is disabled - Clear top 16 bits of va only on older cpus - Print formation regarding the the MMU mode - hash: Update SDR1 size encoding as documented in ISA 3.0 - radix: Update PID switch sequence - radix: Update machine call back to support new HCALL. - radix: Add LPID based tlb flush helpers - radix: Add a kernel command line to disable radix - Cleanup LPCR defines Boot code consolidation from Benjamin Herrenschmidt: - Move epapr_paravirt_early_init() to early_init_devtree() - cell: Don't use flat device-tree after boot - ge_imp3a: Don't use the flat device-tree after boot - mpc85xx_ds: Don't use the flat device-tree after boot - mpc85xx_rdb: Don't use the flat device-tree after boot - Don't test for machine type in rtas_initialize() - Don't test for machine type in smp_setup_cpu_maps() - dt: Add of_device_compatible_match() - Factor do_feature_fixup calls - Move 64-bit feature fixup earlier - Move 64-bit memory reserves to setup_arch() - Use a cachable DART - Move FW feature probing out of pseries probe() - Put exception configuration in a common place - Remove early allocation of the SMU command buffer - Move MMU backend selection out of platform code - pasemi: Remove IOBMAP allocation from platform probe() - mm/hash: Don't use machine_is() early during boot - Don't test for machine type to detect HEA special case - pmac: Remove spurrious machine type test - Move hash table ops to a separate structure - Ensure that ppc_md is empty before probing for machine type - Move 64-bit probe_machine() to later in the boot process - Move 32-bit probe() machine to later in the boot process - Get rid of ppc_md.init_early() - Move the boot time info banner to a separate function - Move setting of {i,d}cache_bsize to initialize_cache_info() - Move the content of setup_system() to setup_arch() - Move cache info inits to a separate function - Re-order the call to smp_setup_cpu_maps() - Re-order setup_panic() - Make a few boot functions __init - Merge 32-bit and 64-bit setup_arch() Other new features: - tty/hvc: Use IRQF_SHARED for OPAL hvc consoles from Sam Mendoza-Jonas - tty/hvc: Use opal irqchip interface if available from Sam Mendoza-Jonas - powerpc: Add module autoloading based on CPU features from Alastair D'Silva - crypto: vmx - Convert to CPU feature based module autoloading from Alastair D'Silva - Wake up kopald polling thread before waiting for events from Benjamin Herrenschmidt - xmon: Dump ISA 2.06 SPRs from Michael Ellerman - xmon: Dump ISA 2.07 SPRs from Michael Ellerman - Add a parameter to disable 1TB segs from Oliver O'Halloran - powerpc/boot: Add OPAL console to epapr wrappers from Oliver O'Halloran - Assign fixed PHB number based on device-tree properties from Guilherme G. Piccoli - pseries: Add pseries hotplug workqueue from John Allen - pseries: Add support for hotplug interrupt source from John Allen - pseries: Use kernel hotplug queue for PowerVM hotplug events from John Allen - pseries: Move property cloning into its own routine from Nathan Fontenot - pseries: Dynamic add entires to associativity lookup array from Nathan Fontenot - pseries: Auto-online hotplugged memory from Nathan Fontenot - pseries: Remove call to memblock_add() from Nathan Fontenot cxl: - Add set and get private data to context struct from Michael Neuling - make base more explicitly non-modular from Paul Gortmaker - Use for_each_compatible_node() macro from Wei Yongjun - Frederic Barrat - Abstract the differences between the PSL and XSL - Make vPHB device node match adapter's - Philippe Bergheaud - Add mechanism for delivering AFU driver specific events - Ignore CAPI adapters misplaced in switched slots - Refine slice error debug messages - Andrew Donnellan - static-ify variables to fix sparse warnings - PCI/hotplug: pnv_php: export symbols and move struct types needed by cxl - PCI/hotplug: pnv_php: handle OPAL_PCI_SLOT_OFFLINE power state - Add cxl_check_and_switch_mode() API to switch bi-modal cards - remove dead Kconfig options - fix potential NULL dereference in free_adapter() - Ian Munsie - Update process element after allocating interrupts - Add support for CAPP DMA mode - Fix allowing bogus AFU descriptors with 0 maximum processes - Fix allocating a minimum of 2 pages for the SPA - Fix bug where AFU disable operation had no effect - Workaround XSL bug that does not clear the RA bit after a reset - Fix NULL pointer dereference on kernel contexts with no AFU interrupts - powerpc/powernv: Split cxl code out into a separate file - Add cxl_slot_is_supported API - Enable bus mastering for devices using CAPP DMA mode - Move cxl_afu_get / cxl_afu_put to base - Allow a default context to be associated with an external pci_dev - Do not create vPHB if there are no AFU configuration records - powerpc/powernv: Add support for the cxl kernel api on the real phb - Add support for using the kernel API with a real PHB - Add kernel APIs to get & set the max irqs per context - Add preliminary workaround for CX4 interrupt limitation - Add support for interrupts on the Mellanox CX4 - Workaround PE=0 hardware limitation in Mellanox CX4 - powerpc/powernv: Fix pci-cxl.c build when CONFIG_MODULES=n selftests: - Test unaligned copy and paste from Chris Smart - Load Monitor Register Tests from Jack Miller - Cyril Bur - exec() with suspended transaction - Use signed long to read perf_event_paranoid - Fix usage message in context_switch - Fix generation of vector instructions/types in context_switch - Michael Ellerman - Use "Delta" rather than "Error" in normal output - Import Anton's mmap & futex micro benchmarks - Add a test for PROT_SAO" * tag 'powerpc-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (263 commits) powerpc/mm: Parenthesise IS_ENABLED() in if condition tty/hvc: Use opal irqchip interface if available tty/hvc: Use IRQF_SHARED for OPAL hvc consoles selftests/powerpc: exec() with suspended transaction powerpc: Improve comment explaining why we modify VRSAVE powerpc/mm: Drop unused externs for hpte_init_beat[_v3]() powerpc/mm: Rename hpte_init_lpar() and move the fallback to a header powerpc/mm: Fix build break when PPC_NATIVE=n crypto: vmx - Convert to CPU feature based module autoloading powerpc: Add module autoloading based on CPU features powerpc/powernv/ioda: Fix endianness when reading TCEs powerpc/mm: Add memory barrier in __hugepte_alloc() powerpc/modules: Never restore r2 for a mprofile-kernel style mcount() call powerpc/ftrace: Separate the heuristics for checking call sites powerpc: Merge 32-bit and 64-bit setup_arch() powerpc/64: Make a few boot functions __init powerpc: Re-order setup_panic() powerpc: Re-order the call to smp_setup_cpu_maps() powerpc/32: Move cache info inits to a separate function powerpc/64: Move the content of setup_system() to setup_arch() ...
2016-07-22powerpc/numa: Convert to hotplug state machineSebastian Andrzej Siewior1-29/+19
Install the callbacks via the state machine. On the boot cpu the callback is invoked manually because cpuhp is not up yet and everything must be preinitialized before additional CPUs are up. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christophe Jaillet <christophe.jaillet@wanadoo.fr> Cc: Anton Blanchard <anton@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160718140727.GA13132@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-06-14powerpc/numa: Fix multiple bugs in memory_hotplug_max()Bharata B Rao1-1/+17
memory_hotplug_max() uses hot_add_drconf_memory_max() to get maxmimum addressable memory by referring to ibm,dyanamic-memory property. There are three problems with the current approach: 1 hot_add_drconf_memory_max() assumes that ibm,dynamic-memory includes all the LMBs of the guest, but that is not true for PowerKVM which populates only DR LMBs (LMBs that can be hotplugged/removed) in that property. 2 hot_add_drconf_memory_max() multiplies lmb-size with lmb-count to arrive at the max possible address. Since ibm,dynamic-memory doesn't include RMA LMBs, the address thus obtained will be less than the actual max address. For example, if max possible memory size is 32G, with lmb-size of 256MB there can be 127 LMBs in ibm,dynamic-memory (1 LMB for RMA which won't be present here). hot_add_drconf_memory_max() would then return the max addressable memory as 127 * 256MB = 31.75GB, the max address should have been 32G which is what ibm,lrdr-capacity shows. 3 In PowerKVM, there can be a gap between the end of boot time RAM and beginning of hotplug RAM area. So just multiplying lmb-count with lmb-size will not provide the correct max possible address for PowerKVM. This patch fixes 1 by using ibm,lrdr-capacity property to return the max addressable memory whenever the property is present. Then it fixes 2 & 3 by fetching the address of the last LMB in ibm,dynamic-memory property. Fixes: cd34206e949b ("powerpc: Add memory_hotplug_max()") Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-14powerpc/numa: Fix whitespace in hot_add_drconf_memory_max()Bharata B Rao1-10/+10
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-11-05Merge tag 'powerpc-4.4-1' of ↵Linus Torvalds1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Kconfig: remove BE-only platforms from LE kernel build from Boqun Feng - Refresh ps3_defconfig from Geoff Levand - Emit GNU & SysV hashes for the vdso from Michael Ellerman - Define an enum for the bolted SLB indexes from Anshuman Khandual - Use a local to avoid multiple calls to get_slb_shadow() from Michael Ellerman - Add gettimeofday() benchmark from Michael Neuling - Avoid link stack corruption in __get_datapage() from Michael Neuling - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar K.V - Add ppc64le_defconfig from Michael Ellerman - pseries: extract of_helpers module from Andy Shevchenko - Correct string length in pseries_of_derive_parent() from Nathan Fontenot - Free the MSI bitmap if it was slab allocated from Denis Kirjanov - Shorten irq_chip name for the SIU from Christophe Leroy - Wait 1s for secondaries to enter OPAL during kexec from Samuel Mendoza-Jonas - Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from Colin Ian King - Disable hugepd for 64K page size, from Aneesh Kumar K.V - Differentiate between hugetlb and THP during page walk from Aneesh Kumar K.V - Make PCI non-optional for pseries from Michael Ellerman - Individual System V IPC system calls from Sam bobroff - Add selftest of unmuxed IPC calls from Michael Ellerman - discard .exit.data at runtime from Stephen Rothwell - Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul Gortmaker - Use of_get_next_parent to simplify code from Christophe Jaillet - Paginate some xmon output from Sam bobroff - Add some more elements to the xmon PACA dump from Michael Ellerman - Allow the tm-syscall selftest to build with old headers from Michael Ellerman - Run EBB selftests only on POWER8 from Denis Kirjanov - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael Ellerman - Avoid reference to potentially freed memory in prom.c from Christophe Jaillet - Quieten boot wrapper output with run_cmd from Geoff Levand - EEH fixes and cleanups from Gavin Shan - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael Ellerman - Fix section mismatch warning in msi_bitmap_alloc() from Denis Kirjanov - Fix ps3-lpm white space from Rudhresh Kumar J - Fix ps3-vuart null dereference from Colin King - nvram: Add missing kfree in error path from Christophe Jaillet - nvram: Fix function name in some errors messages, from Christophe Jaillet - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro Koskinen - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov - cxl: Free virtual PHB when removing from Andrew Donnellan - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from Michael Ellerman - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building with O= from Michael Ellerman - Freescale updates from Scott: Highlights include 64-bit book3e kexec/kdump support, a rework of the qoriq clock driver, device tree changes including qoriq fman nodes, support for a new 85xx board, and some fixes. - MPC5xxx updates from Anatolij: Highlights include a driver for MPC512x LocalPlus Bus FIFO with its device tree binding documentation, mpc512x device tree updates and some minor fixes. * tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits) powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc() powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id() powerpc/pseries: Correct string length in pseries_of_derive_parent() powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s) powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes powerpc: handle error case in cpm_muram_alloc() powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake powerpc/book3e-64: Enable kexec powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32 powerpc/book3e-64/kexec: Enable SMP release powerpc/book3e-64/kexec: create an identity TLB mapping powerpc/book3e-64: Don't limit paca to 256 MiB powerpc/book3e/kdump: Enable crash_kexec_wait_realmode powerpc/book3e: support CONFIG_RELOCATABLE powerpc/booke64: Fix args to copy_and_flush powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts powerpc/e6500: kexec: Handle hardware threads ...
2015-11-05arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodesRaghavendra K T1-1/+1
With the setup_nr_nodes(), we have already initialized node_possible_map. So it is safe to use for_each_node here. There are many places in the kernel that use hardcoded 'for' loop with nr_node_ids, because all other architectures have numa nodes populated serially. That should be reason we had maintained the same for powerpc. But, since sparse numa node ids possible on powerpc, we unnecessarily allocate memory for non existent numa nodes. For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and we allocate memory for nodes 2-14. This patch we allocate memory for only existing numa nodes. The patch is boot tested on a 4 node tuleta, confirming with printks that it works as expected. Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-15powerpc/numa: Use of_get_next_parent to simplify codeChristophe Jaillet1-4/+1
of_get_next_parent can be used to simplify the while() loop and avoid the need of a temp variable. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-18powerpc/numa: initialize distance lookup table from drconf pathNikunj A Dadhania1-3/+13
In some situations, a NUMA guest that supports ibm,dynamic-memory-reconfiguration node will end up having flat NUMA distances between nodes. This is because of two problems in the current code. 1) Different representations of associativity lists. There is an assumption about the associativity list in initialize_distance_lookup_table(). Associativity list has two forms: a) [cpu,memory]@x/ibm,associativity has following format: <N> <N integers> b) ibm,dynamic-reconfiguration-memory/ibm,associativity-lookup-arrays <M> <N> <M associativity lists each having N integers> M = the number of associativity lists N = the number of entries per associativity list Fix initialize_distance_lookup_table() so that it does not assume "case a". And update the caller to skip the length field before sending the associativity list. 2) Distance table not getting updated from drconf path. Node distance table will not get initialized in certain cases as ibm,dynamic-reconfiguration-memory path does not initialize the lookup table. Call initialize_distance_lookup_table() from drconf path with appropriate associativity list. Reported-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-23powerpc/numa: Reset node_possible_map to only node_online_mapNishanth Aravamudan1-0/+7
Raghu noticed an issue with excessive memory allocation on power with a simple cgroup test, specifically, in mem_cgroup_css_alloc -> for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup directories). The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes possible), which defines node_possible_map, which in turn defines the value of nr_node_ids in setup_nr_node_ids and the iteration of for_each_node. In practice, we never see a system with 256 NUMA nodes, and in fact, we do not support node hotplug on power in the first place, so the nodes that are online when we come up are the nodes that will be present for the lifetime of this kernel. So let's, at least, drop the NUMA possible map down to the online map at runtime. This is similar to what x86 does in its initialization routines. mem_cgroup_css_alloc should also be fixed to only iterate over memory-populated nodes and handle hotplug, but that is a separate change. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Anton Blanchard <anton@samba.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-18powerpc/vphn: move VPHN parsing logic to a separate fileGreg Kurz1-58/+3
The goal behind this patch is to be able to write userland tests for the VPHN parsing code. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-18powerpc/vphn: move endianness fixing to vphn_unpack_associativity()Greg Kurz1-4/+6
The first argument to vphn_unpack_associativity() is a const long *, but the parsing code expects __be64 values actually. Let's move the endian fixing down for consistency. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-18powerpc/vphn: clarify the H_HOME_NODE_ASSOCIATIVITY APIGreg Kurz1-2/+6
The number of values returned by the H_HOME_NODE_ASSOCIATIVITY h_call deserves to be explicitly defined, for a better understanding of the code. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-11Merge tag 'powerpc-3.19-1' of ↵Linus Torvalds1-189/+35
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc updates from Michael Ellerman: "Some nice cleanups like removing bootmem, and removal of __get_cpu_var(). There is one patch to mm/gup.c. This is the generic GUP implementation, but is only used by us and arm(64). We have an ack from Steve Capper, and although we didn't get an ack from Andrew he told us to take the patch through the powerpc tree. There's one cxl patch. This is in drivers/misc, but Greg said he was happy for us to manage fixes for it. There is an infrastructure patch to support an IPMI driver for OPAL. There is also an RTC driver for OPAL. We weren't able to get any response from the RTC maintainer, Alessandro Zummo, so in the end we just merged the driver. The usual batch of Freescale updates from Scott" * tag 'powerpc-3.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (101 commits) powerpc/powernv: Return to cpu offline loop when finished in KVM guest powerpc/book3s: Fix partial invalidation of TLBs in MCE code. powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault powerpc/xmon: Cleanup the breakpoint flags powerpc/xmon: Enable HW instruction breakpoint on POWER8 powerpc/mm/thp: Use tlbiel if possible powerpc/mm/thp: Remove code duplication powerpc/mm/hugetlb: Sanity check gigantic hugepage count powerpc/oprofile: Disable pagefaults during user stack read powerpc/mm: Check for matching hpte without taking hpte lock powerpc: Drop useless warning in eeh_init() powerpc/powernv: Cleanup unused MCE definitions/declarations. powerpc/eeh: Dump PHB diag-data early powerpc/eeh: Recover EEH error on ownership change for BCM5719 powerpc/eeh: Set EEH_PE_RESET on PE reset powerpc/eeh: Refactor eeh_reset_pe() powerpc: Remove more traces of bootmem powerpc/pseries: Initialise nvram_pstore_info's buf_lock cxl: Name interrupts in /proc/interrupt cxl: Return error to PSL if IRQ demultiplexing fails & print clearer warning ...
2014-11-24of/reconfig: Always use the same structure for notifiersGrant Likely1-2/+1
The OF_RECONFIG notifier callback uses a different structure depending on whether it is a node change or a property change. This is silly, and not very safe. Rework the code to use the same data structure regardless of the type of notifier. Signed-off-by: Grant Likely <grant.likely@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com> Cc: <linuxppc-dev@lists.ozlabs.org>
2014-11-10powerpc: Move sparse_init() into initmem_initAnton Blanchard1-0/+2
We did part of sparse initialisation in setup_arch and part in initmem_init. Put them together. Signed-off-by: Anton Blanchard <anton@samba.org> Tested-by: Emil Medve <Emilian.Medve@Freescale.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-11-10powerpc: Remove bootmem allocatorAnton Blanchard1-190/+34
At the moment we transition from the memblock alloctor to the bootmem allocator. Gitting rid of the bootmem allocator removes a bunch of complicated code (most of which I owe the dubious honour of being responsible for writing). Signed-off-by: Anton Blanchard <anton@samba.org> Tested-by: Emil Medve <Emilian.Medve@Freescale.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-29powerpc/numa: ensure per-cpu NUMA mappings are correct on topology updateNishanth Aravamudan1-1/+4
We received a report of warning in kernel/sched/core.c where the sched group was NULL on an LPAR after a topology update. This seems to occur because after the topology update has moved the CPUs, cpu_to_node is returning the old value still, which ends up breaking the consistency of the NUMA topology in the per-cpu maps. Ensure that we update the per-cpu fields when we re-map CPUs. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-29powerpc/numa: use cached value of update->cpu in update_cpu_topologyNishanth Aravamudan1-2/+2
There isn't any need to keep referring to update->cpu, as we've already checked cpu == update->cpu at this point. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-16powerpc/vphn: NUMA node code expects big-endianGreg Kurz1-0/+3
The associativity domain numbers are obtained from the hypervisor through registers and written into memory by the guest: the packed array passed to vphn_unpack_associativity() is then native-endian, unlike what was assumed in the following commit: commit b08a2a12e44eaec5024b2b969f4fcb98169d1ca3 Author: Alistair Popple <alistair@popple.id.au> Date: Wed Aug 7 02:01:44 2013 +1000 powerpc: Make NUMA device node code endian safe This issue fills the topology with bogus data and makes it unusable. It may lead to severe performance breakdowns. We should ideally patch the vphn_unpack_associativity() function to do the 64-bit loads, but this requires some more brain storming. In the meantime, let's go for a suboptimal and temporary bug fix: this patch converts each 64-bit value of the packed array to big endian, as expected by the current parsing code in vphn_unpack_associativity(). Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-13powerpc/numa: Add ability to disable and debug topology updatesNishanth Aravamudan1-1/+34
We have hit a few customer issues with the topology update code (VPHN and PRRN). It would be nice to be able to debug the notifications coming from the hypervisor in both cases to the LPAR, as well as to disable responding to the notifications at boot-time, to narrow down the source of the problems. Add a basic level of such functionality, similar to the numa= command-line parameter. We already have a toggle in /proc/powerpc/topology_updates that allows run-time enabling/disabling, so the updates can be started at run-time if desired. But the bugs we've run into have occured during boot or very shortly after coming to login, and have resulted in a broken NUMA topology. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-13powerpc/numa: check error return from proc_createNishanth Aravamudan1-1/+2
proc_create can fail, we should check the return value and pass up the failure. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-04Merge branch 'next' of ↵Michael Ellerman1-8/+0
git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git Freescale updates from Scott (27 commits): "Highlights include DMA32 zone support (SATA, USB, etc now works on 64-bit FSL kernels), MSI changes, 8xx optimizations and cleanup, t104x board support, and PrPMC PCI enumeration."
2014-09-25powerpc: some changes in numa_setup_cpu()Li Zhong1-6/+8
this patches changes some error handling logics in numa_setup_cpu(), when cpu node is not found, so: if the cpu is possible, but not present, -1 is kept in numa_cpu_lookup_table, so later, if the cpu is added, we could set correct numa information for it. if the cpu is present, then we set the first online node to numa_cpu_lookup_table instead of 0 ( in case 0 might not be an online node? ) Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-25powerpc: Only set numa node information for present cpus at boottimeLi Zhong1-1/+1
As Nish suggested, it makes more sense to init the numa node informatiion for present cpus at boottime, which could also avoid WARN_ON(1) in numa_setup_cpu(). With this change, we also need to change the smp_prepare_cpus() to set up numa information only on present cpus. For those possible, but not present cpus, their numa information will be set up after they are started, as the original code did before commit 2fabf084b6ad. Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Tested-by: Cyril Bur <cyril.bur@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-25powerpc: Fix warning reported by verify_cpu_node_mapping()Li Zhong1-2/+1
With commit 2fabf084b6ad ("powerpc: reorder per-cpu NUMA information's initialization"), during boottime, cpu_numa_callback() is called earlier(before their online) for each cpu, and verify_cpu_node_mapping() uses cpu_to_node() to check whether siblings are in the same node. It skips the checking for siblings that are not online yet. So the only check done here is for the bootcpu, which is online at that time. But the per-cpu numa_node cpu_to_node() uses hasn't been set up yet (which will be set up in smp_prepare_cpus()). So I saw something like following reported: [ 0.000000] CPU thread siblings 1/2/3 and 0 don't belong to the same node! As we don't actually do the checking during this early stage, so maybe we could directly call numa_setup_cpu() in do_init_bootmem(). Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-19powerpc/mm: Use common paging_init() for NUMAScott Wood1-8/+0
Commit 1c98025c6c95bc057a25e2c6596de23288c68160 "powerpc: Dynamic DMA zone limits" updated how zones are created in paging_init(), but missed the NUMA version of paging_init(). This was noticed via a linker error, since dma_pfn_limit_to_zone() was, like the non-NUMA paging_init(), limited by #ifndef CONFIG_NEED_MULTIPLE_NODES. It turns out that the NUMA paging_init() was not actually doing anything different from the standard paging_init(), other than a couple debug prints, a couple 32-bit-only ifdef sections, and a call to mark_nonram_nosave(). It's not clear whether mark_nonram_nosave() is inherently wrong to do for NUMA, or just not useful on targets that have NUMA, but for now I'm preserving the existing behavior. Fixes: 1c98025c6c9 "powerpc: Dynamic DMA zone limits" Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Scott Wood <scottwood@freescale.com>
2014-08-13powerpc: reorder per-cpu NUMA information's initializationNishanth Aravamudan1-3/+10
There is an issue currently where NUMA information is used on powerpc (and possibly ia64) before it has been read from the device-tree, which leads to large slab consumption with CONFIG_SLUB and memoryless nodes. NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate after start_secondary(), similar to ia64, which is invoked via smp_init(). Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as early_initcall()") made init_workqueues() be invoked via do_pre_smp_initcalls(), which is obviously before the secondary processors are online. Additionally, the following commits changed init_workqueues() to use cpu_to_node to determine the node to use for kthread_create_on_node: bce903809ab3f ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]") f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to the allowed cpumask") Therefore, when init_workqueues() runs, it sees all CPUs as being on Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to a high number of slab deactivations (http://www.spinics.net/lists/linux-mm/msg67489.html). Fix this by initializing the powerpc-specific CPU<->node/local memory node mapping as early as possible, which on powerpc is do_init_bootmem(). Currently that function initializes the mapping for the boot CPU, but we extend it to setup the mapping for all possible CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the per-cpu values for all possible CPUs. That ensures that before the early_initcalls run (and really as early as possible), the per-cpu NUMA mapping is accurate. While testing memoryless nodes on PowerKVM guests with a fix to the workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a guest topology of: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 node 1 size: 16336 MB node 1 free: 15329 MB node distances: node 0 1 0: 10 40 1: 40 10 the slab consumption decreases from Slab: 932416 kB SUnreclaim: 902336 kB to Slab: 395264 kB SUnreclaim: 359424 kB And we a corresponding increase in the slab efficiency from slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 337 MB 11.28% 100.00% task_struct 288 MB 9.93% 100.00% to slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 37 MB 100.00% 100.00% task_struct 31 MB 100.00% 100.00% Powerpc didn't support memoryless nodes until recently (64bb80d87f01 "powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also helped improve memory consumption with these kind of environments. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-08-05powerpc/mm/numa: Fix break placementAndrey Utkin1-1/+1
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631 Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-18powerpc/mm: fix ".__node_distance" undefinedMike Qiu1-0/+1
CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h ... Building modules, stage 2. WARNING: 1 bad relocations c0000000013d6a30 R_PPC64_ADDR64 uprobes_fetch_type_table WRAP arch/powerpc/boot/zImage.pseries WRAP arch/powerpc/boot/zImage.epapr MODPOST 1849 modules ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 make: *** Waiting for unfinished jobs.... The reason is symbol "__node_distance" not been exported in powerpc. Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Alistair Popple <alistair@popple.id.au> Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-09power, sched: stop updating inside arch_update_cpu_topology() when nothing ↵Michael Wang1-0/+15
to be update Since v1: Edited the comment according to Srivatsa's suggestion. During the testing, we encounter below WARN followed by Oops: WARNING: at kernel/sched/core.c:6218 ... NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200 LR [c000000000101358] .build_sched_domains+0xec8/0x1200 PACATMSCRATCH [800000000000f032] Call Trace: [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0 LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200 PACATMSCRATCH [8000000000029032] Call Trace: [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0 [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... This was caused by that 'sd->groups == NULL' after building groups, which was caused by the empty 'sd->span'. The cpu's domain contained nothing because the cpu was assigned to a wrong node, due to the following unfortunate sequence of events: 1. The hypervisor sent a topology update to the guest OS, to notify changes to the cpu-node mapping. However, the update was actually redundant - i.e., the "new" mapping was exactly the same as the old one. 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting the 'for-loop' in arch_update_cpu_topology(). 3. So we ended up calling stop-machine() with an empty cpumask list, which made stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as the cpu to run the payload (the update_cpu_topology() function). 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates' is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology() finds update->cpu as well as update->new_nid to be 0. In other words, we end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly. Along with the following wrong updating, it causes the sched-domain rebuild code to break and crash the system. Fix this by skipping the topology update in cases where we find that the topology has not actually changed in reality (ie., spurious updates). CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: Nathan Fontenot <nfont@linux.vnet.ibm.com> CC: Stephen Rothwell <sfr@canb.auug.org.au> CC: Andrew Morton <akpm@linux-foundation.org> CC: Robert Jennings <rcj@linux.vnet.ibm.com> CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com> CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> CC: Alistair Popple <alistair@popple.id.au> Suggested-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-29powerpc/numa: Fix decimal permissionsJoe Perches1-1/+1
This should have been octal. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-27Merge branch 'next' of ↵Linus Torvalds1-5/+91
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "So here's my next branch for powerpc. A bit late as I was on vacation last week. It's mostly the same stuff that was in next already, I just added two patches today which are the wiring up of lockref for powerpc, which for some reason fell through the cracks last time and is trivial. The highlights are, in addition to a bunch of bug fixes: - Reworked Machine Check handling on kernels running without a hypervisor (or acting as a hypervisor). Provides hooks to handle some errors in real mode such as TLB errors, handle SLB errors, etc... - Support for retrieving memory error information from the service processor on IBM servers running without a hypervisor and routing them to the memory poison infrastructure. - _PAGE_NUMA support on server processors - 32-bit BookE relocatable kernel support - FSL e6500 hardware tablewalk support - A bunch of new/revived board support - FSL e6500 deeper idle states and altivec powerdown support You'll notice a generic mm change here, it has been acked by the relevant authorities and is a pre-req for our _PAGE_NUMA support" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits) powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked() powerpc: Add support for the optimised lockref implementation powerpc/powernv: Call OPAL sync before kexec'ing powerpc/eeh: Escalate error on non-existing PE powerpc/eeh: Handle multiple EEH errors powerpc: Fix transactional FP/VMX/VSX unavailable handlers powerpc: Don't corrupt transactional state when using FP/VMX in kernel powerpc: Reclaim two unused thread_info flag bits powerpc: Fix races with irq_work Move precessing of MCE queued event out from syscall exit path. pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines powerpc: Make add_system_ram_resources() __init powerpc: add SATA_MV to ppc64_defconfig powerpc/powernv: Increase candidate fw image size powerpc: Add debug checks to catch invalid cpu-to-node mappings powerpc: Fix the setup of CPU-to-Node mappings during CPU online powerpc/iommu: Don't detach device without IOMMU group powerpc/eeh: Hotplug improvement powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space powerpc/eeh: Add restore_config operation ...
2014-01-21memblock: make memblock_set_node() support different memblock_typeTang Chen1-3/+5
[sfr@canb.auug.org.au: fix powerpc build] Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-15powerpc: Add debug checks to catch invalid cpu-to-node mappingsSrivatsa S. Bhat1-2/+24
There have been some weird bugs in the past where the kernel tried to associate threads of the same core to different NUMA nodes, and things went haywire after that point (as expected). But unfortunately, root-causing such issues have been quite challenging, due to the lack of appropriate debug checks in the kernel. These bugs usually lead to some odd soft-lockups in the scheduler's build-sched-domain code in the CPU hotplug path, which makes it very hard to trace it back to the incorrect cpu-to-node mappings. So add appropriate debug checks to catch such invalid cpu-to-node mappings as early as possible. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-15powerpc: Fix the setup of CPU-to-Node mappings during CPU onlineSrivatsa S. Bhat1-3/+67
On POWER platforms, the hypervisor can notify the guest kernel about dynamic changes in the cpu-numa associativity (VPHN topology update). Hence the cpu-to-node mappings that we got from the firmware during boot, may no longer be valid after such updates. This is handled using the arch_update_cpu_topology() hook in the scheduler, and the sched-domains are rebuilt according to the new mappings. But unfortunately, at the moment, CPU hotplug ignores these updated mappings and instead queries the firmware for the cpu-to-numa relationships and uses them during CPU online. So the kernel can end up assigning wrong NUMA nodes to CPUs during subsequent CPU hotplug online operations (after booting). Further, a particularly problematic scenario can result from this bug: On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8) threads per core. The switch to Single-Threaded (ST) mode is performed by offlining all except the first CPU thread in each core. Switching back to SMT mode involves onlining those other threads back, in each core. Now consider this scenario: 1. During boot, the kernel gets the cpu-to-node mappings from the firmware and assigns the CPUs to NUMA nodes appropriately, during CPU online. 2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and communicates this update to the kernel. The kernel in turn updates its cpu-to-node associations and rebuilds its sched domains. Everything is fine so far. 3. Now, the user switches the machine from SMT to ST mode (say, by running ppc64_cpu --smt=1). This involves offlining all except 1 thread in each core. 4. The user then tries to switch back from ST to SMT mode (say, by running ppc64_cpu --smt=4), and this involves onlining those threads back. Since CPU hotplug ignores the new mappings, it queries the firmware and tries to associate the newly onlined sibling threads to the old NUMA nodes. This results in sibling threads within the same core getting associated with different NUMA nodes, which is incorrect. The scheduler's build-sched-domains code gets thoroughly confused with this and enters an infinite loop and causes soft-lockups, as explained in detail in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings). So to fix this, use the numa_cpu_lookup_table to remember the updated cpu-to-node mappings, and use them during CPU hotplug online operations. Further, we also need to ensure that all threads in a core are assigned to a common NUMA node, irrespective of whether all those threads were online during the topology update. To achieve this, we take care not to use cpu_sibling_mask() since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread() and set up the mappings manually using the 'threads_per_core' value for that particular platform. This helps us ensure that we don't hit this bug with any combination of CPU hotplug and SMT mode switching. Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-11-13mm: use pgdat_end_pfn() to simplify the code in archXishi Qiu1-2/+1
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn + pgdat->node_spanned_pages". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-30powerpc: Fix warnings for arch/powerpc/mm/numa.cRobert Jennings1-4/+4
Simple fixes for sparse warnings in this file. Resolves: arch/powerpc/mm/numa.c:198:24: warning: Using plain integer as NULL pointer arch/powerpc/mm/numa.c:1157:5: warning: symbol 'hot_add_node_scn_to_nid' was not declared. Should it be static? arch/powerpc/mm/numa.c:1238:28: warning: Using plain integer as NULL pointer arch/powerpc/mm/numa.c:1538:6: warning: symbol 'topology_schedule_update' was not declared. Should it be static? Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14powerpc: Make NUMA device node code endian safeAlistair Popple1-48/+52
The device tree is big endian so make sure we byteswap on little endian. We assume any pHyp calls also return big endian results in memory. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14powerpc: Stop using non-architected shared_proc field in lppacaAnton Blanchard1-1/+1
Although the shared_proc field in the lppaca works today, it is not architected. A shared processor partition will always have a non zero yield_count so use that instead. Create a wrapper so users don't have to know about the details. In order for older kernels to continue to work on KVM we need to set the shared_proc bit. While here, remove the ugly bitfield. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-01powerpc: VPHN topology change updates all siblingsRobert Jennings1-15/+44
When an associativity level change is found for one thread, the siblings threads need to be updated as well. This is done today for PRRN in stage_topology_update() but is missing for VPHN in update_cpu_associativity_changes_mask(). This patch will correctly update all thread siblings during a topology change. Without this patch a topology update can result in a CPU in init_sched_groups_power() getting stuck indefinitely in a loop. This loop is built in build_sched_groups(). As a result of the thread moving to a node separate from its siblings the struct sched_group will have its next pointer set to point to itself rather than the sched_group struct of the next thread. This happens because we have a domain without the SD_OVERLAP flag, which is correct, and a topology that doesn't conform with reality (threads on the same core assigned to different numa nodes). When this list is traversed by init_sched_groups_power() it will reach the thread's sched_group structure and loop indefinitely; the cpu will be stuck at this point. The bug was exposed when VPHN was enabled in commit b7abef0 (v3.9). Cc: <stable@vger.kernel.org> [v3.9+] Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc/numa: Do not update sysfs cpu registration from invalid contextNathan Fontenot1-2/+3
The topology update code that updates the cpu node registration in sysfs should not be called while in stop_machine(). The register/unregister calls take a lock and may sleep. This patch moves these calls outside of the call to stop_machine(). Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Delete __cpuinit usage from all usersPaul Gortmaker1-4/+3
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the powerpc uses of the __cpuinit macros. There are no __CPUINIT users in assembly files in powerpc. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Josh Boyer <jwboyer@gmail.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02Merge branch 'next' of ↵Linus Torvalds1-62/+216
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc update from Benjamin Herrenschmidt: "The main highlights this time around are: - A pile of addition POWER8 bits and nits, such as updated performance counter support (Michael Ellerman), new branch history buffer support (Anshuman Khandual), base support for the new PCI host bridge when not using the hypervisor (Gavin Shan) and other random related bits and fixes from various contributors. - Some rework of our page table format by Aneesh Kumar which fixes a thing or two and paves the way for THP support. THP itself will not make it this time around however. - More Freescale updates, including Altivec support on the new e6500 cores, new PCI controller support, and a pile of new boards support and updates. - The usual batch of trivial cleanups & fixes" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) powerpc: Fix build error for book3e powerpc: Context switch the new EBB SPRs powerpc: Turn on the EBB H/FSCR bits powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S powerpc: Setup BHRB instructions facility in HFSCR for POWER8 powerpc: Fix interrupt range check on debug exception powerpc: Update tlbie/tlbiel as per ISA doc powerpc: Print page size info during boot powerpc: print both base and actual page size on hash failure powerpc: Fix hpte_decode to use the correct decoding for page sizes powerpc: Decode the pte-lp-encoding bits correctly. powerpc: Use encode avpn where we need only avpn values powerpc: Reduce PTE table memory wastage powerpc: Move the pte free routines from common header powerpc: Reduce the PTE_INDEX_SIZE powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format powerpc: New hugepage directory format powerpc: Don't truncate pgd_index wrongly powerpc: Don't hard code the size of pte page powerpc: Save DAR and DSISR in pt_regs on MCE ...
2013-04-30powerpc/pseries: Correct builds break when CONFIG_SMP not definedNathan Fontenot1-0/+8
Correct build failure for powerpc/pseries builds with CONFIG_SMP not defined. The function cpu_sibling_mask has no meaning (or definition) when CONFIG_SMP is not defined. Additionally, the updating of NUMA affinity for a CPU in a UP system doesn't really make sense. This patch ifdef's out the code making the affinity updates for PRRN events to fix the following build break. arch/powerpc/mm/numa.c: In function ‘stage_topology_update’: arch/powerpc/mm/numa.c:1535: error: implicit declaration of function ‘cpu_sibling_mask’ arch/powerpc/mm/numa.c:1535: warning: passing argument 3 of ‘cpumask_or’ makes pointer from integer without a cast make[1]: *** [arch/powerpc/mm/numa.o] Error 1 Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Fix build failure after merge of the cgroup treeStephen Rothwell1-0/+1
After merging the cgroup tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology': arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration] arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror] arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration] Caused by commit 30c05350c39d ("powerpc/pseries: Use stop machine to update cpu maps") from the powerpc tree interacting with (probably) commit ff794dea52ea ("cpuset: remove include of cgroup.h from cpuset.h") from the cgroup tree. Removing includes from header files is fraught with danger ... The former should have added an include of linux/slab.h to arch/powerpc/mm/numa.c. I have added the following merge fix patch for today (but it should be applied to the powerpc tree ASAP). From: Stephen Rothwell <sfr@canb.auug.org.au> Date: Mon, 29 Apr 2013 14:01:44 +1000 Subject: [PATCH] powerpc: numa.c: using kzalloc/kfree requires including slab.h fixes these build errors: arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology': arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration] arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror] arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-29Merge branch 'for-3.10' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...
2013-04-29powerpc/mm/numa: use setup_nr_node_ids() instead of opencoding.Cody P Schafer1-6/+3
[sfr@canb.auug.org.au: add missing semicolon] Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-26powerpc/pseries: Add /proc interface to control topology updatesNathan Fontenot1-1/+61
There are instances in which we do not want topology updates to occur. In order to allow this a /proc interface (/proc/powerpc/topology_updates) is introduced so that topology updates can be enabled and disabled. This patch also adds a prrn_is_enabled() call so that PRRN events are handled in the kernel only if topology updating is enabled. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26powerpc/pseries: RE-enable Virtual Processor Home Node updatingJesse Larrew1-2/+1
The new PRRN firmware feature provides a more convenient and event-driven interface than VPHN for notifying Linux of changes to the NUMA affinity of platform resources. However, for practical reasons, it may not be feasible for some customers to update to the latest firmware. For these customers, the VPHN feature supported on previous firmware versions may still be the best option. The VPHN feature was previously disabled due to races with the load balancing code when accessing the NUMA cpu maps, but the new stop_machine() approach protects the NUMA cpu maps from these concurrent accesses. It should be safe to re-enable this feature now. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26powerpc/pseries: Update NUMA VDSO information when updating CPU mapsJesse Larrew1-1/+7
The following patch adds vdso_getcpu_init(), which stores the NUMA node for a cpu in SPRG3: Commit 18ad51dd34 ("powerpc: Add VDSO version of getcpu") adds vdso_getcpu_init(), which stores the NUMA node for a cpu in SPRG3. This patch ensures that this information is also updated when the NUMA affinity of a cpu changes. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26powerpc/pseries: Use stop machine to update cpu mapsNathan Fontenot1-19/+65
The new PRRN firmware feature allows CPU and memory resources to be transparently reassigned across NUMA boundaries. When this happens, the kernel must update the node maps to reflect the new affinity information. Although the NUMA maps can be protected by locking primitives during the update itself, this is insufficient to prevent concurrent accesses to these structures. Since cpumask_of_node() hands out a pointer to these structures, they can still be modified outside of the lock. Furthermore, tracking down each usage of these pointers and adding locks would be quite invasive and difficult to maintain. The approach used is to make a list of affected cpus and call stop_machine to have the update routine run on each of the affected cpus allowing them to update themselves. Each cpu finds itself in the list of cpus and makes the appropriate updates. We need to have each cpu do this for themselves to handle calls to vdso_getcpu_init() added in a subsequent patch. Situations like these are best handled using stop_machine(). Since the NUMA affinity updates are exceptionally rare events, this approach has the benefit of not adding any overhead while accessing the NUMA maps during normal operation. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26powerpc/pseries: Update CPU maps when device tree is updatedJesse Larrew1-24/+75
Platform events such as partition migration or the new PRRN firmware feature can cause the NUMA characteristics of a CPU to change, and these changes will be reflected in the device tree nodes for the affected CPUs. This patch registers a handler for Open Firmware device tree updates and reconfigures the CPU and node maps whenever the associativity changes. Currently, this is accomplished by marking the affected CPUs in the cpu_associativity_changes_mask and allowing arch_update_cpu_topology() to retrieve the new associativity information using hcall_vphn(). Protecting the NUMA cpu maps from concurrent access during an update operation will be addressed in a subsequent patch in this series. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26powerpc/pseries: Update numa.c to use updated firmware_has_feature()Nathan Fontenot1-19/+3
Update the numa code to use the updated firmware_has_feature() when checking for type 1 affinity. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-18powerpc: fix annotation of fake_numa_create_new_node()Stephen Rothwell1-1/+1
This function has always been marked as __cpuinit, but is only called from functions marked as __init and references an __initdata variable. So change its annotation to __init. Fixes this build warning: WARNING: arch/powerpc/mm/built-in.o(.cpuinit.text+0x86): Section mismatch in reference from the function .fake_numa_create_new_node() to the variable .init.data:cmdline The function __cpuinit .fake_numa_create_new_node() references a variable __initdata cmdline. If cmdline is only used by .fake_numa_create_new_node then annotate cmdline with a matching annotation. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
2013-04-18powerpc: fix numa distance for form0 device treeVaidyanathan Srinivasan1-1/+1
The following commit breaks numa distance setup for old powerpc systems that use form0 encoding in device tree. commit 41eab6f88f24124df89e38067b3766b7bef06ddb powerpc/numa: Use form 1 affinity to setup node distance Device tree node /rtas/ibm,associativity-reference-points would index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or form1 encoding detected by ibm,architecture-vec-5 property. All modern systems use form1 and current kernel code is correct. However, on older systems with form0 encoding, the numa distance will get hard coded as LOCAL_DISTANCE for all nodes. This causes task scheduling anomaly since scheduler will skip building numa level domain (topmost domain with all cpus) if all numa distances are same. (value of 'level' in sched_init_numa() will remain 0) Prior to the above commit: ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) Restoring compatible behavior with this patch for old powerpc systems with device tree where numa distance are encoded as form0. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
2012-11-15powerpc+of: Move of_drconf_cell struct definition to asm/prom.hNathan Fontenot1-12/+0
This patch moves the definition of the of_drconf_cell struct to asm/prom.h to make it available for all powerpc/pseries code. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Acked-by: Rob Herring <rob.herring@calxeda.com> Acked-by: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-05powerpc/vphn: Fix arch_update_cpu_topology() return valueJesse Larrew1-3/+4
arch_update_cpu_topology() should only return 1 when the topology has actually changed, and should return 0 otherwise. This patch fixes a potential bug where rebuild_sched_domains() would reinitialize the sched domains even when the topology hasn't changed. Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-10Merge branch 'merge' into nextBenjamin Herrenschmidt1-1/+1
We want to bring in the latest IRQ fixes
2012-07-10powerpc/numa: Avoid stupid uninitialized warning from gccBenjamin Herrenschmidt1-1/+1
Newer gcc are being a bit blind here (it's pretty obvious we don't reach the code path using the array if we haven't initialized the pointer) but none of that is performance critical so let's just silence it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc/numa: Fix OF node refcounting bugGavin Shan1-0/+2
The form affinity for NUMA is set to 1 if the firmware supports OPAL. Otherwise, we have to retrieve that from OF node "/chosen". For the latter case, OF node "/chosen" reference count was never decreased. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-06-29powerpc: Fix uninitialised error in numa.cMichael Neuling1-1/+1
chroma_defconfig currently gives me this with gcc 4.6: arch/powerpc/mm/numa.c:638:13: error: 'dm' may be used uninitialized in this function [-Werror=uninitialized] It's a bogus warning/error since of_get_drconf_memory() only writes it anyway. Signed-off-by: Michael Neuling <mikey@neuling.org> cc: <stable@kernel.org> [v3.3+] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-28Disintegrate asm/system.h for PowerPCDavid Howells1-1/+1
Disintegrate asm/system.h for PowerPC. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> cc: linuxppc-dev@lists.ozlabs.org
2012-01-12cpumask: update setup_node_to_cpumask_map() commentsWanlong Gao1-1/+1
node_to_cpumask() has been replaced by cpumask_of_node(), and wholly removed since commit 29c337a0 ("cpumask: remove obsolete node_to_cpumask now everyone uses cpumask_of_node"). So update the comments for setup_node_to_cpumask_map(). Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-08Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits) Kconfig: acpi: Fix typo in comment. misc latin1 to utf8 conversions devres: Fix a typo in devm_kfree comment btrfs: free-space-cache.c: remove extra semicolon. fat: Spelling s/obsolate/obsolete/g SCSI, pmcraid: Fix spelling error in a pmcraid_err() call tools/power turbostat: update fields in manpage mac80211: drop spelling fix types.h: fix comment spelling for 'architectures' typo fixes: aera -> area, exntension -> extension devices.txt: Fix typo of 'VMware'. sis900: Fix enum typo 'sis900_rx_bufer_status' decompress_bunzip2: remove invalid vi modeline treewide: Fix comment and string typo 'bufer' hyper-v: Update MAINTAINERS treewide: Fix typos in various parts of the kernel, and fix some comments. clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR gpio: Kconfig: drop unknown symbol 'CS5535_GPIO' leds: Kconfig: Fix typo 'D2NET_V2' sound: Kconfig: drop unknown symbol ARCH_CLPS7500 ... Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new kconfig additions, close to removed commented-out old ones)
2012-01-07Merge branch 'driver-core-next' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (73 commits) arm: fix up some samsung merge sysdev conversion problems firmware: Fix an oops on reading fw_priv->fw in sysfs loading file Drivers:hv: Fix a bug in vmbus_driver_unregister() driver core: remove __must_check from device_create_file debugfs: add missing #ifdef HAS_IOMEM arm: time.h: remove device.h #include driver-core: remove sysdev.h usage. clockevents: remove sysdev.h arm: convert sysdev_class to a regular subsystem arm: leds: convert sysdev_class to a regular subsystem kobject: remove kset_find_obj_hinted() m86k: gpio - convert sysdev_class to a regular subsystem mips: txx9_sram - convert sysdev_class to a regular subsystem mips: 7segled - convert sysdev_class to a regular subsystem sh: dma - convert sysdev_class to a regular subsystem sh: intc - convert sysdev_class to a regular subsystem power: suspend - convert sysdev_class to a regular subsystem power: qe_ic - convert sysdev_class to a regular subsystem power: cmm - convert sysdev_class to a regular subsystem s390: time - convert sysdev_class to a regular subsystem ... Fix up conflicts with 'struct sysdev' removal from various platform drivers that got changed: - arch/arm/mach-exynos/cpu.c - arch/arm/mach-exynos/irq-eint.c - arch/arm/mach-s3c64xx/common.c - arch/arm/mach-s3c64xx/cpu.c - arch/arm/mach-s5p64x0/cpu.c - arch/arm/mach-s5pv210/common.c - arch/arm/plat-samsung/include/plat/cpu.h - arch/powerpc/kernel/sysfs.c and fix up cpu_is_hotpluggable() as per Greg in include/linux/cpu.h
2012-01-06Merge branch 'next' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (185 commits) powerpc: fix compile error with 85xx/p1010rdb.c powerpc: fix compile error with 85xx/p1023_rds.c powerpc/fsl: add MSI support for the Freescale hypervisor arch/powerpc/sysdev/fsl_rmu.c: introduce missing kfree powerpc/fsl: Add support for Integrated Flash Controller powerpc/fsl: update compatiable on fsl 16550 uart nodes powerpc/85xx: fix PCI and localbus properties in p1022ds.dts powerpc/85xx: re-enable ePAPR byte channel driver in corenet32_smp_defconfig powerpc/fsl: Update defconfigs to enable some standard FSL HW features powerpc: Add TBI PHY node to first MDIO bus sbc834x: put full compat string in board match check powerpc/fsl-pci: Allow 64-bit PCIe devices to DMA to any memory address powerpc: Fix unpaired probe_hcall_entry and probe_hcall_exit offb: Fix setting of the pseudo-palette for >8bpp offb: Add palette hack for qemu "standard vga" framebuffer offb: Fix bug in calculating requested vram size powerpc/boot: Change the WARN to INFO for boot wrapper overlap message powerpc/44x: Fix build error on currituck platform powerpc/boot: Change the load address for the wrapper to fit the kernel powerpc/44x: Enable CRASH_DUMP for 440x ... Fix up a trivial conflict in arch/powerpc/include/asm/cputime.h due to the additional sparse-checking code for cputime_t.
2012-01-06Merge branch 'driver-core-next' into Linux 3.2Greg Kroah-Hartman1-4/+4
This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file, and it fixes the build error in the arch/x86/kernel/microcode_core.c file, that the merge did not catch. The microcode_core.c patch was provided by Stephen Rothwell <sfr@canb.auug.org.au> who was invaluable in the merge issues involved with the large sysdev removal process in the driver-core tree. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystemKay Sievers1-4/+4
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem and converts the devices to regular devices. The sysdev drivers are implemented as subsystem interfaces now. After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Userspace relies on events and generic sysfs subsystem infrastructure from sysdev devices, which are made available with this conversion. Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@amd64.org> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Cc: Len Brown <lenb@kernel.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-19powerpc/mm: Fix section mismatch for read_n_cellsDavid Rientjes1-1/+1
read_n_cells() cannot be marked as .devinit.text since it is referenced from two functions that are not in that section: of_get_lmb_size() and hot_add_drconf_scn_to_nid(). Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-19powerpc/mm: Fix section mismatch for mark_reserved_regions_for_nidDavid Rientjes1-1/+1
mark_reserved_regions_for_nid() is only called from do_init_bootmem(), which is in .init.text, so it must be in the same section to avoid a section mismatch warning. Reported-by: Subrata Modak <subrata@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-08powerpc: Use HAVE_MEMBLOCK_NODE_MAPTejun Heo1-6/+4
powerpc doesn't access early_node_map[] directly and enabling HAVE_MEMBLOCK_NODE_MAP is trivial - replacing add_active_range() calls with memblock_set_node() and selecting HAVE_MEMBLOCK_NODE_MAP is enough. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Yinghai Lu <yinghai@kernel.org>
2011-12-02treewide: Fix typos in various parts of the kernel, and fix some comments.Justin P. Mattock1-1/+1
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments. Please let me know if I missed anything, and I will try to get it changed and resent. Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-11-28Merge branch 'master' into x86/memblockTejun Heo1-17/+29
Conflicts & resolutions: * arch/x86/xen/setup.c dc91c728fd "xen: allow extra memory to be in multiple regions" 24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..." conflicted on xen_add_extra_mem() updates. The resolution is trivial as the latter just want to replace memblock_x86_reserve_range() with memblock_reserve(). * drivers/pci/intel-iommu.c 166e9278a3f "x86/ia64: intel-iommu: move to drivers/iommu/" 5dfe8660a3d "bootmem: Replace work_with_active_regions() with..." conflicted as the former moved the file under drivers/iommu/. Resolved by applying the chnages from the latter on the moved file. * mm/Kconfig 6661672053a "memblock: add NO_BOOTMEM config symbol" c378ddd53f9 "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option" conflicted trivially. Both added config options. Just letting both add their own options resolves the conflict. * mm/memblock.c d1f0ece6cdc "mm/memblock.c: small function definition fixes" ed7b56a799c "memblock: Remove memblock_memory_can_coalesce()" confliected. The former updates function removed by the latter. Resolution is trivial. Signed-off-by: Tejun Heo <tj@kernel.org>
2011-11-08powerpc/numa: NUMA topology support for PowerNVDipankar Sarma1-7/+17
This patch adds support for numa topology on powernv platforms running OPAL formware. It checks for the type of platform at run time and sets the affinity form correctly so that NUMA topology can be discovered correctly. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-06Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-10-31powerpc: various straight conversions from module.h --> export.hPaul Gortmaker1-1/+1
All these files were including module.h just for the basic EXPORT_SYMBOL infrastructure. We can shift them off to the export.h header which is a way smaller footprint and thus realize some compile time gains. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-09-20powerpc: Coding style cleanupsAnton Blanchard1-3/+4
While converting code to use for_each_node_by_type I noticed a number of coding style issues. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20powerpc: Use for_each_node_by_type instead of open coding itAnton Blanchard1-5/+5
Use for_each_node_by_type instead of open coding it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20powerpc/numa: Remove double of_node_put in hot_add_node_scn_to_nidAnton Blanchard1-1/+2
During memory hotplug testing, I got the following warning: ERROR: Bad of_node_put() on /memory@0 of_node_release kref_put of_node_put of_find_node_by_type hot_add_node_scn_to_nid hot_add_scn_to_nid memory_add_physaddr_to_nid ... of_find_node_by_type() loop does the of_node_put for us so we only need the handle the case where we terminate the loop early. As suggested by Stephen Rothwell we can do the of_node_put unconditionally outside of the loop since of_node_put handles a NULL argument fine. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-07-14bootmem: Replace work_with_active_regions() with for_each_mem_pfn_range()Tejun Heo1-35/+15
Callback based iteration is cumbersome and much less useful than for_each_*() iterator. This patch implements for_each_mem_pfn_range() which replaces work_with_active_regions(). All the current users of work_with_active_regions() are converted. This simplifies walking over early_node_map and will allow converting internal logics in page_alloc to use iterator instead of walking early_node_map directly, which in turn will enable moving node information to memblock. powerpc change is only compile tested. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110714074610.GD3455@htj.dyndns.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-05-04powerpc: Convert old cpumask API into new oneKOSAKI Motohiro1-1/+1
Adapt new API. Almost change is trivial. Most important change is the below line because we plan to change task->cpus_allowed implementation. - ctx->cpus_allowed = current->cpus_allowed; Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-27powerpc/numa: Look for ibm, associativity-reference-points at the rootMichael Ellerman1-8/+7
If we don't find ibm,associativity-reference-points as a child of /rtas, look for it at the root of the tree instead. We use this on Book3E where we have no RTAS but still use the sPAPR conventions for NUMA. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-03-31Fix common misspellingsLucas De Marchi1-5/+5
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-10powerpc/pseries: Disable VPNH featureBenjamin Herrenschmidt1-1/+2
This feature triggers nasty races in the scheduler between the rebuilding of the topology and the load balancing code, causing the machine to hang. Disable it for now until the races are fixed. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-02-07powerpc/numa: Fix bug in unmap_cpu_from_nodeAnton Blanchard1-1/+1
When converting to the new cpumask code I screwed up: - if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) { - cpu_clear(cpu, numa_cpumask_lookup_table[node]); + if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) { + cpumask_set_cpu(cpu, node_to_cpumask_map[node]); This was introduced in commit 25863de07af9 (powerpc/cpumask: Convert NUMA code to new cpumask API) Fix it. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-02-07powerpc/numa: Disable VPHN on dedicated processor partitionsAnton Blanchard1-1/+2
There is no need to start up the timer and monitor topology changes on a dedicated processor partition, so disable it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-02-07powerpc/numa: Add length when creating OF properties via VPHNAnton Blanchard1-3/+9
The rest of the NUMA code expects an OF associativity property with the first cell containing the length. Without this fix all topology changes cause us to misparse the property and put the cpu into node 0. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-02-07powerpc/numa: Check for all VPHN changesAnton Blanchard1-1/+1
The hypervisor uses unsigned 1 byte counters to signal topology changes to the OS. Since they can wrap we need to check for any difference, not just if the hypervisor count is greater than the previous count. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-02-07powerpc/numa: Only use active VPHN count fieldsAnton Blanchard1-4/+6
VPHN supports up to 8 distance fields but the number of entries in ibm,associativity-reference-points signifies how many are in use. Don't look at all the VPHN counts, only distance_ref_points_depth worth. Since we already cap our distance metrics at MAX_DISTANCE_REF_POINTS, use that to size the VPHN arrays and add a BUILD_BUG_ON to avoid it growing larger than the VPHN maximum of 8. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>