aboutsummaryrefslogtreecommitdiffstats
path: root/arch/powerpc/mm/hash_utils_64.c
AgeCommit message (Collapse)AuthorFilesLines
2019-05-03powerpc/mm: Move book3s64 specifics in subdirectory mm/book3s64Christophe Leroy1-1930/+0
Many files in arch/powerpc/mm are only for book3S64. This patch creates a subdirectory for them. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: Update the selftest sym links, shorten new filenames, cleanup some whitespace and formatting in the new files.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-28powerpc/mm: fix spelling mistake "Outisde" -> "Outside"Colin Ian King1-2/+2
There are several identical spelling mistakes in warning messages, fix these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm: Validate address values against different region limitsAneesh Kumar K.V1-3/+15
This adds an explicit check in various functions. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm/hash64: Map all the kernel regions in the same 0xc rangeAneesh Kumar K.V1-11/+15
This patch maps vmalloc, IO and vmemap regions in the 0xc address range instead of the current 0xd and 0xf range. This brings the mapping closer to radix translation mode. With hash 64K page size each of this region is 512TB whereas with 4K config we are limited by the max page table range of 64TB and hence there regions are of 16TB size. The kernel mapping is now: On 4K hash kernel_region_map_size = 16TB kernel vmalloc start = 0xc000100000000000 kernel IO start = 0xc000200000000000 kernel vmemmap start = 0xc000300000000000 64K hash, 64K radix and 4k radix: kernel_region_map_size = 512TB kernel vmalloc start = 0xc008000000000000 kernel IO start = 0xc00a000000000000 kernel vmemmap start = 0xc00c000000000000 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm/hash64: Add a variable to track the end of IO mappingAneesh Kumar K.V1-0/+1
This makes it easy to update the region mapping in the later patch Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerc/mm/hash: Reduce hash_mm_context sizeAneesh Kumar K.V1-0/+3
Allocate subpage protect related variables only if we use the feature. This helps in reducing the hash related mm context struct by around 4K Before the patch sizeof(struct hash_mm_context) = 8288 After the patch sizeof(struct hash_mm_context) = 4160 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm: Reduce memory usage for mm_context_t for radixAneesh Kumar K.V1-1/+3
Currently, our mm_context_t on book3s64 include all hash specific context details like slice mask and subpage protection details. We can skip allocating these with radix translation. This will help us to save 8K per mm_context with radix translation. With the patch applied we have sizeof(mm_context_t) = 136 sizeof(struct hash_mm_context) = 8288 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm: Move slb_addr_linit to early_init_mmuAneesh Kumar K.V1-0/+2
Avoid #ifdef in generic code. Also enables us to do this specific to MMU translation mode on book3s64 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21powerpc/mm: Add helpers for accessing hash translation related variablesAneesh Kumar K.V1-5/+5
We want to switch to allocating them runtime only when hash translation is enabled. Add helpers so that both book3s and nohash can be adapted to upcoming change easily. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20powerpc/mm: move warning from resize_hpt_for_hotplug()Laurent Vivier1-12/+7
resize_hpt_for_hotplug() reports a warning when it cannot resize the hash page table ("Unable to resize hash page table to target order") but in some cases it's not a problem and can make user thinks something has not worked properly. This patch moves the warning to arch_remove_memory() to only report the problem when it is needed. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-03-12treewide: add checks for the return value of memblock_alloc*()Mike Rapoport1-0/+3
Add check for the return value of memblock_alloc*() functions and call panic() in case of error. The panic message repeats the one used by panicing memblock allocators with adjustment of parameters to include only relevant ones. The replacement was mostly automated with semantic patches like the one below with manual massaging of format strings. @@ expression ptr, size, align; @@ ptr = memblock_alloc(size, align); + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type] Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc] Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails] Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx [akpm@linux-foundation.org: fix xtensa printk warning] Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: Guo Ren <ren_guo@c-sky.com> [c-sky] Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390] Reviewed-by: Juergen Gross <jgross@suse.com> [Xen] Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Max Filippov <jcmvbkbc@gmail.com> [xtensa] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12memblock: drop memblock_alloc_base()Mike Rapoport1-2/+6
The memblock_alloc_base() function tries to allocate a memory up to the limit specified by its max_addr parameter and panics if the allocation fails. Replace its usage with memblock_phys_alloc_range() and make the callers check the return value and panic in case of error. Link: http://lkml.kernel.org/r/1548057848-15136-10-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-3/+3
Merge more updates from Andrew Morton: - some of the rest of MM - various misc things - dynamic-debug updates - checkpatch - some epoll speedups - autofs - rapidio - lib/, lib/lzo/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) samples/mic/mpssd/mpssd.h: remove duplicate header kernel/fork.c: remove duplicated include include/linux/relay.h: fix percpu annotation in struct rchan arch/nios2/mm/fault.c: remove duplicate include unicore32: stop printing the virtual memory layout MAINTAINERS: fix GTA02 entry and mark as orphan mm: create the new vm_fault_t type arm, s390, unicore32: remove oneliner wrappers for memblock_alloc() arch: simplify several early memory allocations openrisc: simplify pte_alloc_one_kernel() sh: prefer memblock APIs returning virtual address microblaze: prefer memblock API returning virtual address powerpc: prefer memblock APIs returning virtual address lib/lzo: separate lzo-rle from lzo lib/lzo: implement run-length encoding lib/lzo: fast 8-byte copy on arm64 lib/lzo: 64-bit CTZ on arm64 lib/lzo: tidy-up ifdefs ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size ipc: annotate implicit fall through ...
2019-03-07powerpc: prefer memblock APIs returning virtual addressMike Rapoport1-3/+3
Patch series "memblock: simplify several early memory allocation", v4. These patches simplify some of the early memory allocations by replacing usage of older memblock APIs with newer and shinier ones. Quite a few places in the arch/ code allocated memory using a memblock API that returns a physical address of the allocated area, then converted this physical address to a virtual one and then used memset(0) to clear the allocated range. More recent memblock APIs do all the three steps in one call and their usage simplifies the code. It's important to note that regardless of API used, the core allocation is nearly identical for any set of memblock allocators: first it tries to find a free memory with all the constraints specified by the caller and then falls back to the allocation with some or all constraints disabled. The first three patches perform the conversion of call sites that have exact requirements for the node and the possible memory range. The fourth patch is a bit one-off as it simplifies openrisc's implementation of pte_alloc_one_kernel(), and not only the memblock usage. The fifth patch takes care of simpler cases when the allocation can be satisfied with a simple call to memblock_alloc(). The sixth patch removes one-liner wrappers for memblock_alloc on arm and unicore32, as suggested by Christoph. This patch (of 6): There are a several places that allocate memory using memblock APIs that return a physical address, convert the returned address to the virtual address and frequently also memset(0) the allocated range. Update these places to use memblock allocators already returning a virtual address. Use memblock functions that clear the allocated memory instead of calling memset(0) where appropriate. The calls to memblock_alloc_base() that were not followed by memset(0) are replaced with memblock_alloc_try_nid_raw(). Since the latter does not panic() when the allocation fails, the appropriate panic() calls are added to the call sites. Link: http://lkml.kernel.org/r/1546248566-14910-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Greentime Hu <green.hu@gmail.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Mark Salter <msalter@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-14powerpc/mm: Fix debugfs_simple_attr.cocci warningsYueHaibing1-3/+3
Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE for debugfs files. Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/64s/hash: Simplify slb_flush_and_rebolt()Nicholas Piggin1-2/+2
slb_flush_and_rebolt() is misleading, it is called in virtual mode, so it can not possibly change the stack, so it should not be touching the shadow area. And since vmalloc is no longer bolted, it should not change any bolted mappings at all. Change the name to slb_flush_and_restore_bolted(), and have it just load the kernel stack from what's currently in the shadow SLB area. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-14powerpc/mm: don't use _PAGE_EXEC for calling hash_preload()Christophe Leroy1-1/+2
The 'access' parameter of hash_preload() is either 0 or _PAGE_EXEC. Among the two versions of hash_preload(), only the PPC64 one is doing something with this 'access' parameter. In order to remove the use of _PAGE_EXEC outside platform code, 'access' parameter is replaced by 'is_exec' which will be either true of false, and the PPC64 version of hash_preload() creates the access flag based on 'is_exec'. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bitAneesh Kumar K.V1-3/+3
With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid pgd/pud/pmd entry. We also switch the p**_present() to look at this bit. With pmd_present, we have a special case. We need to make sure we consider a pmd marked invalid during THP split as present. Right now we clear the _PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit for this special case. pmd_present is also updated to look at _PAGE_INVALID. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03Revert "convert SLB miss handlers to C" and subsequent commitsMichael Ellerman1-11/+35
This reverts commits: 5e46e29e6a97 ("powerpc/64s/hash: convert SLB miss handlers to C") 8fed04d0f6ae ("powerpc/64s/hash: remove user SLB data from the paca") 655deecf67b2 ("powerpc/64s/hash: SLB allocation status bitmaps") 2e1626744e8d ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup") 89ca4e126a3f ("powerpc/64s/hash: Add a SLB preload cache") This series had a few bugs, and the fixes are not all trivial. So revert most of it for now. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: remove user SLB data from the pacaNicholas Piggin1-35/+11
User SLB mappig data is copied into the PACA from the mm->context so it can be accessed by the SLB miss handlers. After the C conversion, SLB miss handlers now run with relocation on, and user SLB misses are able to take recursive kernel SLB misses, so the user SLB mapping data can be removed from the paca and accessed directly. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-30powerpc: remove unnecessary inclusion of asm/tlbflush.hChristophe Leroy1-1/+0
asm/tlbflush.h is only needed for: - using functions xxx_flush_tlb_xxx() - using MMU_NO_CONTEXT - including asm-generic/pgtable.h Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-24powerpc/mm/hash: Remove the superfluous bitwise operation when find hpte groupAneesh Kumar K.V1-6/+4
When computing the starting slot number for a hash page table group we used to do this hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL; Multiplying with 8 (HPTES_PER_GROUP) imply the last three bits are 0. Hence we really don't need to clear then separately. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-16powerpc/64s: Remove POWER9 DD1 supportNicholas Piggin1-30/+0
POWER9 DD1 was never a product. It is no longer supported by upstream firmware, and it is not effectively supported in Linux due to lack of testing. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [mpe: Remove arch_make_huge_pte() entirely] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-03Merge branch 'topic/ppc-kvm' into nextMichael Ellerman1-0/+1
Merge in some commits we're sharing with the kvm-ppc tree.
2018-05-24powerpc: Export tm_enable()/tm_disable/tm_abort() APIsSimon Guo1-0/+1
This patch exports tm_enable()/tm_disable/tm_abort() APIs, which will be used for PR KVM transactional memory logic. Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm: Use page fragments for allocation page table at PMD levelAneesh Kumar K.V1-1/+0
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15powerpc/mm: Implement helpers for pagetable fragment support at PMD levelAneesh Kumar K.V1-0/+2
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-03powerpc/fadump: Do not use hugepages when fadump is activeHari Bathini1-2/+4
FADump capture kernel boots in restricted memory environment preserving the context of previous kernel to save vmcore. Supporting hugepages in such environment makes things unnecessarily complicated, as hugepages need memory set aside for them. This means most of the capture kernel's memory is used in supporting hugepages. In most cases, this results in out-of-memory issues while booting FADump capture kernel. But hugepages are not of much use in capture kernel whose only job is to save vmcore. So, disabling hugepages support, when fadump is active, is a reliable solution for the out of memory issues. Introducing a flag variable to disable HugeTLB support when fadump is active. Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-01powerpc/64s: Remove POWER4 supportNicholas Piggin1-4/+5
POWER4 has been broken since at least the change 49d09bf2a6 ("powerpc/64s: Optimise MSR handling in exception handling"), which requires mtmsrd L=1 support. This was introduced in ISA v2.01, and POWER4 supports ISA v2.00. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-31Merge branch 'topic/paca' into nextMichael Ellerman1-1/+1
Bring in yet another series that touches KVM code, and might need to be merged into the kvm-ppc branch to resolve conflicts. This required some changes in pnv_power9_force_smt4_catch/release() due to the paca array becomming an array of pointers.
2018-03-31powerpc/mm: Add support for handling > 512TB address in SLB missAneesh Kumar K.V1-2/+2
For addresses above 512TB we allocate additional mmu contexts. To make it all easy, addresses above 512TB are handled with IR/DR=1 and with stack frame setup. The mmu_context_t is also updated to track the new extended_ids. To support upto 4PB we need a total 8 contexts. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Minor formatting tweaks and comment wording, switch BUG to WARN in get_ea_context().] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-31powerpc/mm: Pass node id into create_section_mappingNicholas Piggin1-1/+1
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Move __map_kernel_page_nid() inside #ifdef SPARSEMEM_VMEMMAP] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-27powerpc/64: Call H_REGISTER_PROC_TBL when running as a HPT guest on POWER9Paul Mackerras1-0/+6
On POWER9, since commit cc3d2940133d ("powerpc/64: Enable use of radix MMU under hypervisor on POWER9", 2017-01-30), we set both the radix and HPT bits in the client-architecture-support (CAS) vector, which tells the hypervisor that we can do either radix or HPT. According to PAPR, if we use this combination we are promising to do a H_REGISTER_PROC_TBL hcall later on to let the hypervisor know whether we are doing radix or HPT. We currently do this call if we are doing radix but not if we are doing HPT. If the hypervisor is able to support both radix and HPT guests, it would be entitled to defer allocation of the HPT until the H_REGISTER_PROC_TBL call, and to fail any attempts to create HPTEs until the H_REGISTER_PROC_TBL call. Thus we need to do a H_REGISTER_PROC_TBL call when we are doing HPT; otherwise we may crash at boot time. This adds the code to call H_REGISTER_PROC_TBL in this case, before we attempt to create any HPT entries using H_ENTER. Fixes: cc3d2940133d ("powerpc/64: Enable use of radix MMU under hypervisor on POWER9") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-06powerpc/mm/slice: Allow up to 64 low slicesChristophe Leroy1-7/+6
While the implementation of the "slices" address space allows a significant amount of high slices, it limits the number of low slices to 16 due to the use of a single u64 low_slices_psize element in struct mm_context_t On the 8xx, the minimum slice size is the size of the area covered by a single PMD entry, ie 4M in 4K pages mode and 64M in 16K pages mode. This means we could have at least 64 slices. In order to override this limitation, this patch switches the handling of low_slices_psize to char array as done already for high_slices_psize. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-02-13powerpc/mm: Fix crashes with 16G huge pagesAneesh Kumar K.V1-0/+1
To support memory keys, we moved the hash pte slot information to the second half of the page table. This was ok with PTE entries at level 4 (PTE page) and level 3 (PMD). We already allocate larger page table pages at those levels to accomodate extra details. For level 4 we already have the extra space which was used to track 4k hash page table entry details and at level 3 the extra space was allocated to track the THP details. With hugetlbfs PTE, we used this extra space at the PMD level to store the slot details. But we also support hugetlbfs PTE at PUD level for 16GB pages and PUD level page didn't allocate extra space. This resulted in memory corruption. Fix this by allocating extra space at PUD level when HUGETLB is enabled. Fixes: bf9a95f9a648 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-22powerpc/pseries: Don't give a warning when HPT resizing isn't availableDavid Gibson1-1/+1
As of 438cc81a41 "powerpc/pseries: Automatically resize HPT for memory hot add/remove" when running on the pseries platform, we always attempt to use the PAPR extension to resize the hashed page table (HPT) when we add or remove memory. This is fine, but when the extension is available we'll give a harmless, but scary warning. This patch suppresses the warning in this case. It will still warn if the feature is supposed to be available, but didn't work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21powerpc/hash: Skip non initialized page size in init_hpte_page_sizesAneesh Kumar K.V1-1/+1
One of the easiest way to test config with 4K HPTE is to disable 64K hardware page size like below. int __init htab_dt_scan_page_sizes(unsigned long node, size -= 3; prop += 3; base_idx = get_idx_from_shift(base_shift); - if (base_idx < 0) { + if (base_idx < 0 || base_idx == MMU_PAGE_64K) { /* skip the pte encoding also */ prop += lpnum * 2; size -= lpnum * 2; But then this results in error in other part of the code such as MPSS parsing where we look at 4K base page size and 64K actual page size support. This patch fix MPSS parsing by ignoring the actual page sizes marked unsupported. In reality this can happen only with a corrupt device tree. But it is good to tighten the error check. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: introduce get_mm_addr_key() helperRam Pai1-0/+24
get_mm_addr_key() helper returns the pkey associated with an address corresponding to a given mm_struct. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: Program HPTE key protection bitsRam Pai1-0/+1
Map the PTE protection key bits to the HPTE key protection bits, while creating HPTE entries. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20powerpc: initial pkey plumbingRam Pai1-0/+1
Basic plumbing to initialize the pkey system. Nothing is enabled yet. A later patch will enable it once all the infrastructure is in place. Signed-off-by: Ram Pai <linuxram@us.ibm.com> [mpe: Rework copyrights to use SPDX tags] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18powerpc/pseries: lift RTAS limit for hashNicholas Piggin1-3/+5
With the previous patch to switch to 64-bit mode after returning from RTAS and before doing any memory accesses, the RMA limit need not be clamped to 1GB to avoid RTAS bugs. Keep the 1GB limit for older firmware (although this is more of a kernel concern than RTAS), and remove it starting with POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18powerpc/powernv: Remove real mode access limit for early allocationsNicholas Piggin1-7/+13
This removes the RMA limit on powernv platform, which constrains early allocations such as PACAs and stacks. There are still other restrictions that must be followed, such as bolted SLB limits, but real mode addressing has no constraints. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18powerpc/64s: Improve local TLB flush for boot and MCE on POWER9Nicholas Piggin1-0/+8
There are several cases outside the normal address space management where a CPU's entire local TLB is to be flushed: 1. Booting the kernel, in case something has left stale entries in the TLB (e.g., kexec). 2. Machine check, to clean corrupted TLB entries. One other place where the TLB is flushed, is waking from deep idle states. The flush is a side-effect of calling ->cpu_restore with the intention of re-setting various SPRs. The flush itself is unnecessary because in the first case, the TLB should not acquire new corrupted TLB entries as part of sleep/wake (though they may be lost). This type of TLB flush is coded inflexibly, several times for each CPU type, and they have a number of problems with ISA v3.0B: - The current radix mode of the MMU is not taken into account, it is always done as a hash flushn For IS=2 (LPID-matching flush from host) and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if the R field does not match the current radix mode. - ISA v3.0B hash must flush the partition and process table caches as well. - ISA v3.0B radix must flush partition and process scoped translations, partition and process table caches, and also the page walk cache. So consolidate the flushing code and implement it in C and inline asm under the mm/ directory with the rest of the flush code. Add ISA v3.0B cases for radix and hash, and use the radix flush in radix environment. Provide a way for IS=2 (LPID flush) to specify the radix mode of the partition. Have KVM pass in the radix mode of the guest. Take out the flushes from early cputable/dt_cpu_ftrs detection hooks, and move it later in the boot process after, the MMU registers are set up and before relocation is first turned on. The TLB flush is no longer called when restoring from deep idle states. This was not be done as a separate step because booting secondaries uses the same cpu_restore as idle restore, which needs the TLB flush. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: use helper functions to get and set hash slotsRam Pai1-9/+4
replace redundant code in __hash_page_4K() and flush_hash_page() with helper functions pte_get_hash_gslot() and pte_set_hidx() Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: Free up four 64K PTE bits in 4K backed HPTE pagesRam Pai1-1/+3
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6, in the 4K backed HPTE pages.These bits continue to be used for 64K backed HPTE pages in this patch, but will be freed up in the next patch. The bit numbers are big-endian as defined in the ISA3.0 The patch does the following change to the 4k HTPE backed 64K PTE's format. H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure below) V0 which occupied bit 4 is not used anymore. V1 which occupied bit 5 is not used anymore. V2 which occupied bit 6 is not used anymore. V3 which occupied bit 7 is not used anymore. Before the patch, the 4k backed 64k PTE format was as follows 0 1 2 3 4 5 6 7 8 9 10...........................63 : : : : : : : : : : : : v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_' After the patch, the 4k backed 64k PTE format is as follows 0 1 2 3 4 5 6 7 8 9 10...........................63 : : : : : : : : : : : : v v v v v v v v v v v v ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-, |x|x|x| | | | | |x|B| |x|x|................|.|.|.|.| <- primary pte '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_' |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_' the four bits S,G,I,X (one quadruplet per 4k HPTE) that cache the hash-bucket slot value, is initialized to 1,1,1,1 indicating -- an invalid slot. If a HPTE gets cached in a 1111 slot(i.e 7th slot of secondary hash bucket), it is released immediately. In other words, even though 1111 is a valid slot value in the hash bucket, we consider it invalid and release the slot and the HPTE. This gives us the opportunity to determine the validity of S,G,I,X bits based on its contents and not on any of the bits V0,V1,V2 or V3 in the primary PTE When we release a HPTE cached in the 1111 slot we also release a legitimate slot in the primary hash bucket and unmap its corresponding HPTE. This is to ensure that we do get a HPTE cached in a slot of the primary hash bucket, the next time we retry. Though treating 1111 slot as invalid, reduces the number of available slots in the hash bucket and may have an effect on the performance, the probabilty of hitting a 1111 slot is extermely low. Compared to the current scheme, the above scheme reduces the number of false hash table updates significantly and has the added advantage of releasing four valuable PTE bits for other purpose. NOTE:even though bits 3, 4, 5, 6, 7 are not used when the 64K PTE is backed by 4k HPTE, they continue to be used if the PTE gets backed by 64k HPTE. The next patch will decouple that aswell, and truely release the bits. This idea was jointly developed by Paul Mackerras, Aneesh, Michael Ellermen and myself. 4K PTE format remains unchanged currently. The patch does the following code changes a) PTE flags are split between 64k and 4k header files. b) __hash_page_4K() is reimplemented to reflect the above logic. Acked-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20powerpc: introduce pte_get_hash_gslot() helperRam Pai1-0/+18
Introduce pte_get_hash_gslot()() which returns the global slot number of the HPTE in the global hash table. This function will come in handy as we work towards re-arranging the PTE bits in the later patches. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06powerpc/mm/hash: Add pr_fmt() to hash_utils64.cAneesh Kumar K.V1-0/+1
Make the printks look a bit nicer by adding a prefix. Radix config now do radix-mmu: Page sizes from device-tree: radix-mmu: Page size shift = 12 AP=0x0 radix-mmu: Page size shift = 16 AP=0x5 radix-mmu: Page size shift = 21 AP=0x1 radix-mmu: Page size shift = 30 AP=0x2 This patch update hash config to do similar dmesg output. With the patch we have hash-mmu: Page sizes from device-tree: hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0 hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7 hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56 hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1 hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8 hash-mmu: base_shift=20: shift=20, sllp=0x0111, avpnm=0x00000000, tlbiel=0, penc=2 hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0 hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23powerpc/mm: Use mm_is_thread_local() instread of open-codingBenjamin Herrenschmidt1-4/+2
We open-code testing for the mm being local to the current CPU in a few places. Use our existing helper instead. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17Merge branch 'topic/ppc-kvm' into nextMichael Ellerman1-2/+3
Bring in the commit to rename find_linux_pte_or_hugepte() which touches arch and KVM code, and might need to be merged with the kvmppc tree to avoid conflicts.
2017-08-17powerpc/mm: Rename find_linux_pte_or_hugepte()Aneesh Kumar K.V1-2/+3
Add newer helpers to make the function usage simpler. It is always recommended to use find_current_mm_pte() for walking the page table. If we cannot use find_current_mm_pte(), it should be documented why the said usage of __find_linux_pte() is safe against a parallel THP split. For now we have KVM code using __find_linux_pte(). This is because kvm code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but with PACA soft_enabled = 1. We may want to fix that later and make sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that we can switch kvm to use find_linux_pte(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-16powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel ↵Aneesh Kumar K.V1-1/+1
command line With commit aa888a74977a8 ("hugetlb: support larger than MAX_ORDER") we added support for allocating gigantic hugepages via kernel command line. Switch ppc64 arch specific code to use that. W.r.t FSL support, we now limit our allocation range using BOOTMEM_ALLOC_ACCESSIBLE. We use the kernel command line to do reservation of hugetlb pages on powernv platforms. On pseries hash mmu mode the supported gigantic huge page size is 16GB and that can only be allocated with hypervisor assist. For pseries the command line option doesn't do the allocation. Instead pseries does gigantic hugepage allocation based on hypervisor hint that is specified via "ibm,expected#pages" property of the memory node. Cc: Scott Wood <oss@buserror.net> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08powerpc/mm/book3s64: Make KERN_IO_START a variableMichael Ellerman1-0/+1
Currently KERN_IO_START is defined as: #define KERN_IO_START (KERN_VIRT_START + (KERN_VIRT_SIZE >> 1)) Although it looks like a constant, both the components are actually variables, to allow us to have a different value between Radix and Hash with a single kernel. However that still requires both Radix and Hash to place the kernel IO region at the same location relative to the start and end of the kernel virtual region (namely 1/2 way through it), and we'd like to change that. So split KERN_IO_START out into its own variable, and initialise it for Radix and Hash. In the medium term we should be able to reconsolidate this, by doing a more involved rearrangement of the location of the regions. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-31powerpc/mm: Fix check of multiple 16G pages from device treeRui Teng1-1/+1
The offset of hugepage block will not be 16G, if the expected page is more than one. Calculate the totol size instead of the hardcode value. Fixes: 4792adbac9eb ("powerpc: Don't use a 16G page if beyond mem= limits") Signed-off-by: Rui Teng <rui.teng@linux.vnet.ibm.com> Tested-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-23powerpc/mm: Trace tlbie(l) instructionsBalbir Singh1-0/+2
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate Entry (Local)) instructions. The tlbie instruction has changed over the years, so not all versions accept the same operands. Use the ISA v3 field operands because they are the most verbose, we may change them in future. Example output: qemu-system-ppc-5371 [016] 1412.369519: tlbie: tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0 Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Add some missing trace_tlbie()s, reword change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11powerpc: Create asm/debugfs.h and move powerpc_debugfs_root thereMichael Ellerman1-2/+1
powerpc_debugfs_root is the dentry representing the root of the "powerpc" directory tree in debugfs. Currently it sits in asm/debug.h, a long with some other things that have "debug" in the name, but are otherwise unrelated. Pull it out into a separate header, which also includes linux/debugfs.h, and convert all the users to include debugfs.h instead of debug.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-03powerpc/mm: Remove stale comment about the DART holeOliver O'Halloran1-5/+0
The code to fix the problem it describes was removed in commit c40785ad305b ("powerpc/dart: Use a cachable DART"), and it uses the stupid comment style. Away it goooooooooooooes! Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/pseries: Skip using reserved virtual address rangeAneesh Kumar K.V1-1/+0
Now that we use all the available virtual address range, we need to make sure we don't generate VSID such that it overlaps with the reserved vsid range. Reserved vsid range include the virtual address range used by the adjunct partition and also the VRMA virtual segment. We find the context value that can result in generating such a VSID and reserve it early in boot. We don't look at the adjunct range, because for now we disable the adjunct usage in a Linux LPAR via CAS interface. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm: Move copy_mm_to_paca to paca.cAneesh Kumar K.V1-2/+2
We also update the function arg to struct mm_struct. Move this so that function finds the definition of struct mm_struct. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm: Move hash specific pte bits to be top bits of RPNAneesh Kumar K.V1-0/+13
We don't support the full 57 bits of physical address and hence can overload the top bits of RPN as hash specific pte bits. Add a BUILD_BUG_ON() to enforce the relationship between H_PAGE_F_SECOND and H_PAGE_F_GIX. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> [mpe: Move the BUILD_BUG_ON() into hash_utils_64.c and comment it] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-02sched/headers: Prepare to remove the <linux/mm_types.h> dependency from ↵Ingo Molnar1-1/+1
<linux/sched.h> Update code that relied on sched.h including various MM types for them. This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-10powerpc/pseries: Automatically resize HPT for memory hot add/removeDavid Gibson1-0/+29
We've now implemented code in the pseries platform to use the new PAPR interface to allow resizing the hash page table (HPT) at runtime. This patch uses that interface to automatically attempt to resize the HPT when memory is hot added or removed. This tries to always keep the HPT at a reasonable size for our current memory size. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-10powerpc/pseries: Add support for hash table resizingDavid Gibson1-0/+33
This adds support for using two hypercalls to change the size of the main hash page table while running as a PAPR guest. For now these hypercalls are only in experimental qemu versions. The interface is two part: first H_RESIZE_HPT_PREPARE is used to allocate and prepare the new hash table. This may be slow, but can be done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the new hash table. This requires that no CPUs be concurrently updating the HPT, and so must be run under stop_machine(). This also adds a debugfs file which can be used to manually control HPT resizing or testing purposes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Mackerras <paulus@samba.org> [mpe: Rename the debugfs file to "hpt_order"] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-17powerpc/mm: Fix memory hotplug BUG() on radixReza Arbab1-2/+2
Memory hotplug is leading to hash page table calls, even on radix: arch_add_memory create_section_mapping htab_bolt_mapping BUG_ON(!ppc_md.hpte_insert); To fix, refactor {create,remove}_section_mapping() into hash__ and radix__ variants. Leave the radix versions stubbed for now. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds1-1/+1
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-24/+4
Pull KVM updates from Paolo Bonzini: "Small release, the most interesting stuff is x86 nested virt improvements. x86: - userspace can now hide nested VMX features from guests - nested VMX can now run Hyper-V in a guest - support for AVX512_4VNNIW and AVX512_FMAPS in KVM - infrastructure support for virtual Intel GPUs. PPC: - support for KVM guests on POWER9 - improved support for interrupt polling - optimizations and cleanups. s390: - two small optimizations, more stuff is in flight and will be in 4.11. ARM: - support for the GICv3 ITS on 32bit platforms" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest KVM: arm/arm64: timer: Check for properly initialized timer on init KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs KVM: x86: Handle the kthread worker using the new API KVM: nVMX: invvpid handling improvements KVM: nVMX: check host CR3 on vmentry and vmexit KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry KVM: nVMX: propagate errors from prepare_vmcs02 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation KVM: nVMX: support restore of VMX capability MSRs KVM: nVMX: generate non-true VMX MSRs based on true versions KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs. KVM: x86: Add kvm_skip_emulated_instruction and use it. KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12 KVM: VMX: Reorder some skip_emulated_instruction calls KVM: x86: Add a return value to kvm_emulate_cpuid KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h ...
2016-11-25powerpc/mm: Fixup kernel read only mappingAneesh Kumar K.V1-2/+6
With commit e58e87adc8bf9 ("powerpc/mm: Update _PAGE_KERNEL_RO") we started using the ppp value 0b110 to map kernel readonly. But that facility was only added as part of ISA 2.04. For earlier ISA version only supported ppp bit value for readonly mapping is 0b011. (This implies both user and kernel get mapped using the same ppp bit value for readonly mapping.). Update the code such that for earlier architecture version we use ppp value 0b011 for readonly mapping. We don't differentiate between power5+ and power5 here and apply the new ppp bits only from power6 (ISA 2.05). This keep the changes minimal. This fixes issue with PS3 spu usage reported at https://lkml.kernel.org/r/rep.1421449714.geoff@infradead.org Fixes: e58e87adc8bf9 ("powerpc/mm: Update _PAGE_KERNEL_RO") Cc: stable@vger.kernel.org # v4.7+ Tested-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-23powerpc/64: Provide functions for accessing POWER9 partition tablePaul Mackerras1-24/+4
POWER9 requires the host to set up a partition table, which is a table in memory indexed by logical partition ID (LPID) which contains the pointers to page tables and process tables for the host and each guest. This factors out the initialization of the partition table into a single function. This code was previously duplicated between hash_utils_64.c and pgtable-radix.c. This provides a function for setting a partition table entry, which is used in early MMU initialization, and will be used by KVM whenever a guest is created. This function includes a tlbie instruction which will flush all TLB entries for the LPID and all caches of the partition table entry for the LPID, across the system. This also moves a call to memblock_set_current_limit(), which was in radix_init_partition_table(), but has nothing to do with the partition table. By analogy with the similar code for hash, the call gets moved to near the end of radix__early_init_mmu(). It now gets called when running as a guest, whereas previously it would only be called if the kernel is running as the host. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-18powerpc/mm: Fix missing update of HID register on secondary CPUsAneesh Kumar K.V1-0/+4
We need to update on secondaries for the selected MMU mode. Fixes: ad410674f560 ("powerpc/mm: Update the HID bit when switching from radix to hash") Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-14Merge tag 'powerpc-4.9-2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull more powerpc updates from Michael Ellerman: "Some more powerpc updates for 4.9: Freescale updates from Scott Wood: - qbman support (a prerequisite for datapath drivers such as ethernet) - a PCI DMA fix+improvement - reset handler changes - more 8xx optimizations - some cleanups and fixes.' Fixes: - selftests/powerpc: Add missing binaries to .gitignores (Michael Ellerman) - selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes (Michael Ellerman) - powerpc/pseries: Fix stack corruption in htpe code (Laurent Dufour) - powerpc/64s: Fix power4_fixup_nap placement (Nicholas Piggin) - powerpc/64: Fix incorrect return value from __copy_tofrom_user (Paul Mackerras) - powerpc/mm/hash64: Fix might_have_hea() check (Michael Ellerman) Other: - MAINTAINERS: Remove myself from PA Semi entries (Olof Johansson) - MAINTAINERS: Drop separate pseries entry (Michael Ellerman) - MAINTAINERS: Update powerpc website & add selftests (Michael Ellerman): * tag 'powerpc-4.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (35 commits) powerpc/mm/hash64: Fix might_have_hea() check powerpc/64: Fix incorrect return value from __copy_tofrom_user powerpc/64s: Fix power4_fixup_nap placement powerpc/pseries: Fix stack corruption in htpe code selftests/powerpc: Fix build break caused by EXPORT_SYMBOL changes MAINTAINERS: Update powerpc website & add selftests MAINTAINERS: Drop separate pseries entry MAINTAINERS: Remove myself from PA Semi entries selftests/powerpc: Add missing binaries to .gitignores arch/powerpc: Add CONFIG_FSL_DPAA to corenetXX_smp_defconfig soc/qman: Add self-test for QMan driver soc/bman: Add self-test for BMan driver soc/fsl: Introduce DPAA 1.x QMan device driver soc/fsl: Introduce DPAA 1.x BMan device driver powerpc/8xx: make user addr DTLB miss the short path powerpc/8xx: Move additional DTLBMiss handlers out of exception area powerpc/8xx: use r3 to scratch CR in ITLBmiss soc/fsl/qe: fix gpio save_regs functions powerpc/8xx: add dedicated machine check handler powerpc/8xx: add system_reset_exception ...
2016-10-12powerpc/mm/hash64: Fix might_have_hea() checkMichael Ellerman1-1/+1
In commit 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type to detect HEA special case") we changed the logic in might_have_hea() to check FW_FEATURE_SPLPAR rather than machine_is(pseries). However the check was incorrectly negated, leading to crashes on machines with HEA adapters, such as: mm: Hashing failure ! EA=0xd000080080004040 access=0x800000000000000e current=NetworkManager trap=0x300 vsid=0x13d349c ssize=1 base psize=2 psize 2 pte=0xc0003cc033e701ae Unable to handle kernel paging request for data at address 0xd000080080004040 Call Trace: .ehea_create_cq+0x148/0x340 [ehea] (unreliable) .ehea_up+0x258/0x1200 [ehea] .ehea_open+0x44/0x1a0 [ehea] ... Fix it by removing the negation. Fixes: 2b4e3ad8f579 ("powerpc/mm/hash64: Don't test for machine type to detect HEA special case") Cc: stable@vger.kernel.org # v4.8+ Reported-by: Denis Kirjanov <kda@linux-powerpc.org> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-07Merge tag 'powerpc-4.9-1' of ↵Linus Torvalds1-31/+50
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights: - Major rework of Book3S 64-bit exception vectors (Nicholas Piggin) - Use gas sections for arranging exception vectors et. al. - Large set of TM cleanups and selftests (Cyril Bur) - Enable transactional memory (TM) lazily for userspace (Cyril Bur) - Support for XZ compression in the zImage wrapper (Oliver O'Halloran) - Add support for bpf constant blinding (Naveen N. Rao) - Beginnings of upstream support for PA Semi Nemo motherboards (Darren Stevens) Fixes: - Ensure .mem(init|exit).text are within _stext/_etext (Michael Ellerman) - xmon: Don't use ld on 32-bit (Michael Ellerman) - vdso64: Use double word compare on pointers (Anton Blanchard) - powerpc/nvram: Fix an incorrect partition merge (Pan Xinhui) - powerpc: Fix usage of _PAGE_RO in hugepage (Christophe Leroy) - powerpc/mm: Update FORCE_MAX_ZONEORDER range to allow hugetlb w/4K (Aneesh Kumar K.V) - Fix memory leak in queue_hotplug_event() error path (Andrew Donnellan) - Replay hypervisor maintenance interrupt first (Nicholas Piggin) Various performance optimisations (Anton Blanchard): - Align hot loops of memset() and backwards_memcpy() - During context switch, check before setting mm_cpumask - Remove static branch prediction in atomic{, 64}_add_unless - Only disable HAVE_EFFICIENT_UNALIGNED_ACCESS on POWER7 little endian - Set default CPU type to POWER8 for little endian builds Cleanups & features: - Sparse fixes/cleanups (Daniel Axtens) - Preserve CFAR value on SLB miss caused by access to bogus address (Paul Mackerras) - Radix MMU fixups for POWER9 (Aneesh Kumar K.V) - Support for setting used_(vsr|vr|spe) in sigreturn path (for CRIU) (Simon Guo) - Optimise syscall entry for virtual, relocatable case (Nicholas Piggin) - Optimise MSR handling in exception handling (Nicholas Piggin) - Support for kexec with Radix MMU (Benjamin Herrenschmidt) - powernv EEH fixes (Russell Currey) - Suprise PCI hotplug support for powernv (Gavin Shan) - Endian/sparse fixes for powernv PCI (Gavin Shan) - Defconfig updates (Anton Blanchard) - KVM: PPC: Book3S HV: Migrate pinned pages out of CMA (Balbir Singh) - cxl: Flush PSL cache before resetting the adapter (Frederic Barrat) - cxl: replace loop with for_each_child_of_node(), remove unneeded of_node_put() (Andrew Donnellan) - Fix HV facility unavailable to use correct handler (Nicholas Piggin) - Remove unnecessary syscall trampoline (Nicholas Piggin) - fadump: Fix build break when CONFIG_PROC_VMCORE=n (Michael Ellerman) - Quieten EEH message when no adapters are found (Anton Blanchard) - powernv: Add PHB register dump debugfs handle (Russell Currey) - Use kprobe blacklist for exception handlers & asm functions (Nicholas Piggin) - Document the syscall ABI (Nicholas Piggin) - MAINTAINERS: Update cxl maintainers (Michael Neuling) - powerpc: Remove all usages of NO_IRQ (Michael Ellerman) Minor cleanups: - Andrew Donnellan, Christophe Leroy, Colin Ian King, Cyril Bur, Frederic Barrat, Pan Xinhui, PrasannaKumar Muralidharan, Rui Teng, Simon Guo" * tag 'powerpc-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (156 commits) powerpc/bpf: Add support for bpf constant blinding powerpc/bpf: Implement support for tail calls powerpc/bpf: Introduce accessors for using the tmp local stack space powerpc/fadump: Fix build break when CONFIG_PROC_VMCORE=n powerpc: tm: Enable transactional memory (TM) lazily for userspace powerpc/tm: Add TM Unavailable Exception powerpc: Remove do_load_up_transact_{fpu,altivec} powerpc: tm: Rename transct_(*) to ck(\1)_state powerpc: tm: Always use fp_state and vr_state to store live registers selftests/powerpc: Add checks for transactional VSXs in signal contexts selftests/powerpc: Add checks for transactional VMXs in signal contexts selftests/powerpc: Add checks for transactional FPUs in signal contexts selftests/powerpc: Add checks for transactional GPRs in signal contexts selftests/powerpc: Check that signals always get delivered selftests/powerpc: Add TM tcheck helpers in C selftests/powerpc: Allow tests to extend their kill timeout selftests/powerpc: Introduce GPR asm helper header file selftests/powerpc: Move VMX stack frame macros to header file selftests/powerpc: Rework FPU stack placement macros and move to header file selftests/powerpc: Check for VSX preservation across userspace preemption ...
2016-09-23powerpc: Clean up tm_abort duplication in hash_utils_64.cRui Teng1-31/+25
The same logic appears twice and should probably be pulled out into a function. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Rui Teng <rui.teng@linux.vnet.ibm.com> [mpe: Rename to tm_flush_hash_page() and move comment into the function] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-13powerpc/mm: Update the HID bit when switching from radix to hashAneesh Kumar K.V1-0/+25
Power9 DD1 requires to update the hid0 register when switching from hash to radix. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-09powerpc/mm: Speed up computation of base and actual page size for a HPTEPaul Mackerras1-0/+55
This replaces a 2-D search through an array with a simple 8-bit table lookup for determining the actual and/or base page size for a HPT entry. The encoding in the second doubleword of the HPTE is designed to encode the actual and base page sizes without using any more bits than would be needed for a 4k page number, by using between 1 and 8 low-order bits of the RPN (real page number) field to encode the page sizes. A single "large page" bit in the first doubleword indicates that these low-order bits are to be interpreted like this. We can determine the page sizes by using the low-order 8 bits of the RPN to look up a 256-entry table. For actual page sizes less than 1MB, some of the upper bits of these 8 bits are going to be real address bits, but we can cope with that by replicating the entries for those smaller page sizes. While we're at it, let's move the hpte_page_size() and hpte_base_page_size() functions from a KVM-specific header to a header for 64-bit HPT systems, since this computation doesn't have anything specifically to do with KVM. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-08-01powerpc/mm: Convert early cpu/mmu feature check to use the new helpersAneesh Kumar K.V1-1/+1
This switches early feature checks to use the non static key variant of the function. In later patches we will be switching cpu_has_feature() and mmu_has_feature() to use static keys and we can use them only after static key/jump label is initialized. Any check for feature before jump label init should be done using this new helper. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01powerpc/mm: Do hash device tree scanning earlierMichael Ellerman1-26/+27
Currently MMU initialisation (early_init_mmu()) consists of a mixture of scanning the device tree, setting MMU feature bits, and then also doing actual initialisation of MMU data structures. We'd like to decouple the setting of the MMU features from the actual setup. So split out the device tree scanning, and associated code, and call it from mmu_init_early_devtree(). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-28powerpc/mm: Parenthesise IS_ENABLED() in if conditionStephen Rothwell1-1/+1
Currently IS_ENABLED() produces an expression surrounded by parentheses, which allows this code to compile, generating eg: else if (1 || 0) hpte_init_native(); However a change to the macro in the kbuild tree will break this in future by removing the parentheses. Fixes: 7353644fa9df ("powerpc/mm: Fix build break when PPC_NATIVE=n") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-26powerpc/mm: Rename hpte_init_lpar() and move the fallback to a headerMichael Ellerman1-6/+1
hpte_init_lpar() is part of the pseries platform, so name it as such. Move the fallback implementation for when PSERIES=n into the header, dropping the weak implementation. The panic() is now handled by the calling code. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-26powerpc/mm: Fix build break when PPC_NATIVE=nMichael Ellerman1-1/+4
The recent commit to rework the hash MMU setup broke the build when CONFIG_PPC_NATIVE=n. Fix it by adding an IS_ENABLED() check before calling hpte_init_native(). Removing the else clause opens the possibility that we don't set any ops, which would probably lead to a strange crash later. So add a check that we correctly initialised at least one member of the struct. Fixes: 166dd7d3fbf2 ("powerpc/64: Move MMU backend selection out of platform code") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21powerpc/mm: Move hash table ops to a separate structureBenjamin Herrenschmidt1-23/+28
Moving probe_machine() to after mmu init will cause the ppc_md fields relative to the hash table management to be overwritten. Since we have essentially disconnected the machine type from the hash backend ops, finish the job by moving them to a different structure. The only callback that didn't quite fix is update_partition_table since this is not specific to hash, so I moved it to a standalone variable for now. We can revisit later if needed. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Fix ppc64e build failure in kexec] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21powerpc/mm/hash64: Don't test for machine type to detect HEA special caseBenjamin Herrenschmidt1-2/+3
Instead, check for FW_FEATURE_SPLPAR. This should be roughtly equivalent as all pseries machiens that can have an HEA also support SPLPAR and no other machine type does. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21powerpc/mm/hash: Don't use machine_is() early during bootBenjamin Herrenschmidt1-9/+17
Use the device-tree instead as we'll be moving probe_machine() out of early_setup Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21powerpc/64: Move MMU backend selection out of platform codeBenjamin Herrenschmidt1-0/+14
We move it into early_mmu_init() based on firmware features. For PS3, we have to move the setting of these into early_init_devtree(). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21powerpc/dart: Use a cachable DARTBenjamin Herrenschmidt1-32/+0
Instead of punching a hole in the linear mapping, just use normal cachable memory, and apply the flush sequence documented in the CPC625 (aka U3) user manual. This allows us to remove quite a bit of code related to the early allocation of the DART and the hole in the linear mapping. We can also get rid of the copy of the DART for suspend/resume as the original memory can just be saved/restored now, as long as we properly sync the caches. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Integrate dart_init() fix to return ENODEV when DART disabled] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17powerpc/mm/hash: Update SDR1 size encoding as documented in ISA 3.0Aneesh Kumar K.V1-5/+4
ISA 3.0 document hash table size in bytes = 2^(HTABSIZE + 18) No functionality change by this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17powerpc/mm: Print formation regarding the the MMU modeAneesh Kumar K.V1-1/+2
This helps in easily identifying the MMU mode with which the kernel is operating. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-15Merge tag 'powerpc-4.7-5' into nextMichael Ellerman1-5/+13
Pull in the fixes we sent during 4.7, we have code we want to merge into next that depends on some of them.
2016-07-05powerpc/mm: Add a parameter to disable 1TB segsOliver O'Halloran1-0/+15
This patch adds the kernel command line parameter "no_tb_segs" which forces the kernel to use 256MB rather than 1TB segments. Forcing the use of 256MB segments makes it considerably easier to test code that depends on an SLB miss occurring. Suggested-by: Michael Neuling <mikey@neuling.org> Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-30powerpc: Initialise pci_io_base as early as possibleDarren Stevens1-0/+4
Commit d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for radix") turned kernel memory and IO addresses from #defined constants to variables initialised at runtime. On PA6T (pasemi) systems the setup_arch() machine call initialises the onboard PCI-e root-ports, and uses pci_io_base to do this, which is now before its value has been set, resulting in a panic early in boot before console IO is initialised. Move the pci_io_base initialisation to the same place as vmalloc ranges are set (hash__early_init_mmu()/radix__early_init_mmu()) - this is the earliest possible place we can initialise it. Fixes: d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for radix") Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Darren Stevens <darren@stevens-zone.net> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Add #ifdef CONFIG_PCI, massage change log slightly] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-17powerpc/mm/hash: Don't add memory coherence if cache inhibited is setAneesh Kumar K.V1-5/+9
H_ENTER hcall handling in qemu had assumptions that a cache inhibited hpte entry won't have memory conference set. Also older kernel mentioned that some version of pHyp required this (the code removed by the below commit says: /* Make pHyp happy */ if ((rflags & _PAGE_NO_CACHE) && !(rflags & _PAGE_WRITETHRU)) hpte_r &= ~HPTE_R_M; But with older kernel we had some inconsistent memory conherence mapping. We always enabled memory conherence in the page fault path and removed memory conherence is _PAGE_NO_CACHE was set when we mapped the page via htab_bolt_mapping. The commit mentioned below tried to consolidate that by always enabling memory conherence. But as mentioned above that breaks Qemu H_ENTER handling. This patch update this such that we enable memory conherence only if cache inhibited is not set and bring fault handling, lpar and bolt mapping in sync. Fixes: commit 30bda41aba4e("powerpc/mm: Drop WIMG in favour of new constant") Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-01powerpc/mm/hash: Fix the reference bit update when handling hash faultAneesh Kumar K.V1-2/+20
When we converted the asm routines to C functions, we missed updating HPTE_R_R based on _PAGE_ACCESSED. ASM code used to copy over the lower bits from pte via. andi. r3,r30,0x1fe /* Get basic set of flags */ We also update the code such that we won't update the Change bit ('C' bit) always. This was added by commit c5cf0e30bf3d8 ("powerpc: Fix buglet with MMU hash management"). With hash64, we need to make sure that hardware doesn't do a pte update directly. This is because we do end up with entries in TLB with no hash page table entry. This happens because when we find a hash bucket full, we "evict" a more/less random entry from it. When we do that we don't invalidate the TLB (hpte_remove) because we assume the old translation is still technically "valid". For more info look at commit 0608d692463("powerpc/mm: Always invalidate tlb on hpte invalidate and update"). Thus it's critical that valid hash PTEs always have reference bit set and writeable ones have change bit set. We do this by hashing a non-dirty linux PTE as read-only and always setting _PAGE_ACCESSED (and thus R) when hashing anything else in. Any attempt by Linux at clearing those bits also removes the corresponding hash entry. Commit 5cf0e30bf3d8 did that for 'C' bit by enabling 'C' bit always. We don't really need to do that because we never map a RW pte entry without setting 'C' bit. On READ fault on a RW pte entry, we still map it READ only, hence a store update in the page will still cause a hash pte fault. This patch reverts the part of commit c5cf0e30bf3d8 ("[PATCH] powerpc: Fix buglet with MMU hash management") and retain the updatepp part. - If we hit the updatepp path on native, the old code without that commit, would fail to set C bcause native_hpte_updatepp() was implemented to filter the same bits as H_PROTECT and not let C through thus we would "upgrade" a RO HPTE to RW without setting C thus causing the bug. So the real fix in that commit was the change to native_hpte_updatepp Fixes: 89ff725051d1 ("powerpc/mm: Convert __hash_page_64K to C") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm/hash64: Fix subpage protection with 4K HPTE configMichael Ellerman1-1/+9
With Linux page size of 64K and hardware only supporting 4K HPTE, if we use subpage protection, we always fail for the subpage 0 as shown below (using the selftest subpage_prot test): 520175565: (4520111850): Failed at 0x3fffad4b0000 (p=13,sp=0,w=0), want=fault, got=pass ! 4520890210: (4520826495): Failed at 0x3fffad5b0000 (p=29,sp=0,w=0), want=fault, got=pass ! 4521574251: (4521510536): Failed at 0x3fffad6b0000 (p=45,sp=0,w=0), want=fault, got=pass ! 4522258324: (4522194609): Failed at 0x3fffad7b0000 (p=61,sp=0,w=0), want=fault, got=pass ! This is because hash preload wrongly inserts the HPTE entry for subpage 0 without looking at the subpage protection information. Fix it by teaching should_hash_preload() not to preload if we have subpage protection configured for that range. It appears this has been broken since it was introduced in 2008. Fixes: fa28237cfcc5 ("[POWERPC] Provide a way to protect 4k subpages when using 64k pages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rework into should_hash_preload() to avoid build fails w/SLICES=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm/hash64: Factor out hash preload psize checkMichael Ellerman1-4/+17
Currently we have a check in hash_preload() against the psize, which is only included when CONFIG_PPC_MM_SLICES is enabled. We want to expand this check in a subsequent patch, so factor it out to allow that. As a bonus it removes the #ifdef in the C code. Unfortunately we can't put this in the existing CONFIG_PPC_MM_SLICES block because it would require a forward declaration. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm: pte_frag abstractionAneesh Kumar K.V1-0/+3
In this patch we make the number of pte fragments per level 4 page table page a variable. Radix level 4 table size is 256 bytes and hence we can have 256 fragments per level 4 page. We don't update the fragment count in this patch. We need to do performance measurements to find the right value for fragment count. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm: vmalloc abstraction in preparation for radixAneesh Kumar K.V1-0/+8
The vmalloc range differs between hash and radix config. Hence make VMALLOC_START and related constants a variable which will be runtime initialized depending on whether hash or radix mode is active. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Fix missing init of ioremap_bot in pgtable_64.c for ppc64e] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm: Add radix pgalloc detailsAneesh Kumar K.V1-0/+7
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm/radix: Update PTCR on secondary CPUsAneesh Kumar K.V1-3/+7
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm: Abstract early MMU init in preparation for radixAneesh Kumar K.V1-3/+3
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm: Make page table size a variableAneesh Kumar K.V1-0/+12
Radix and hash MMU models support different page table sizes. Make the #defines a variable so that existing code can work with variable sizes. Slice related code is only used by hash, so use hash constants there. We will replicate some of the boundary conditions with resepct to TASK_SIZE using radix values too. Right now we do boundary condition check using hash constants. Swapper pgdir size is initialized in asm code. We select the max pgd size to keep it simple. For now we select hash pgdir. When adding radix we will switch that to radix pgdir which is 64K. BUILD_BUG_ON check which is removed is already done in hugepage_init() using MAYBE_BUILD_BUG_ON(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm/book3s: Rename hash specific PTE bits to carry H_ prefixAneesh Kumar K.V1-4/+4
This helps to make following hash only pte bits easier. We have kept _PAGE_CHG_MASK, _HPAGE_CHG_MASK and _PAGE_PROT_BITS as it is in this patch eventhough they use hash specific bits. Using them in radix as it is should be ok, because with radix we expect those bit positions to be zero. Only renames in this patch, no change in functionality. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm/hash: Add support for Power9 HashAneesh Kumar K.V1-2/+40
PowerISA 3.0 adds a parition table indexed by LPID. Parition table allows us to specify the MMU model that will be used for guest and host translation. This patch adds support with SLB based hash model (UPRT = 0). What is required with this model is to support the new hash page table entry format and also setup partition table such that we use hash table for address translation. We don't have segment table support yet. In order to make sure we don't load KVM module on Power9 (since we don't have kvm support yet) this patch also disables KVM on Power9. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm: Drop WIMG in favour of new constantsAneesh Kumar K.V1-9/+9
PowerISA 3.0 introduces two pte bits with the below meaning for radix: 00 -> Normal Memory 01 -> Strong Access Order (SAO) 10 -> Non idempotent I/O (Cache inhibited and guarded) 11 -> Tolerant I/O (Cache inhibited) We drop the existing WIMG bits in the Linux page table in favour of the above constants. We loose _PAGE_WRITETHRU with this conversion. We only use writethru via pgprot_cached_wthru() which is used by fbdev/controlfb.c which is Apple control display and also PPC32. With respect to _PAGE_COHERENCE, we have been marking hpte always coherent for some time now. htab_convert_pte_flags() always added HPTE_R_M. NOTE: KVM changes need closer review. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm: Update _PAGE_KERNEL_ROAneesh Kumar K.V1-6/+11
PS3 had used a PPP bit hack to implement a read only mapping in the kernel area. Since we are bolting the ioremap area, it used the pte flags _PAGE_PRESENT | _PAGE_USER to get a PPP value of 0x3 there by resulting in a read only mapping. This means the area can be accessed by user space, but kernel will never return such an address to user space. But we can do better by implementing a read only kernel mapping using PPP bits 0b110. This also allows us to do read only kernel mapping for radix in later patches. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGEDAneesh Kumar K.V1-6/+10
_PAGE_PRIVILEGED means the page can be accessed only by the kernel. This is done to keep pte bits similar to PowerISA 3.0 Radix PTE format. User pages are now marked by clearing _PAGE_PRIVILEGED bit. Previously we allowed the kernel to have a privileged page in the lower address range (USER_REGION). With this patch such access is denied. We also prevent a kernel access to a non-privileged page in higher address range (ie, REGION_ID != 0). Both the above access scenarios should never happen. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm/subpage: Clear RWX bit to indicate no accessAneesh Kumar K.V1-3/+8
Subpage protection used to depend on the _PAGE_USER bit to implement no access mode. This patch switches that to use _PAGE_RWX. We clear Read, Write and Execute access from the pte instead of clearing _PAGE_USER now. This was done so that we can switch to _PAGE_PRIVILEGED in a later patch. subpage_protection() returns pte bits that need to be cleared. Instead of updating the interface to handle no-access in a separate way, it appears simpler to clear RWX acecss to indicate no access. We still don't insert hash ptes for no access implied by !_PAGE_RWX. Hence we should not get PROT_FAULT with change. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm: Use _PAGE_READ to indicate Read accessAneesh Kumar K.V1-4/+5
This splits the _PAGE_RW bit into _PAGE_READ and _PAGE_WRITE. It also removes the dependency on _PAGE_USER for implying read only. Few things to note here is that, we have read implied with write and execute permission. Hence we should always find _PAGE_READ set on hash pte fault. We still can't switch PROT_NONE to !(_PAGE_RWX). Auto numa depends on marking a prot none pte _PAGE_WRITE. (For more details look at b191f9b106ea "mm: numa: preserve PTE write permissions across a NUMA hinting fault") Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-19Merge tag 'powerpc-4.6-1' of ↵Linus Torvalds1-33/+54
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "This was delayed a day or two by some build-breakage on old toolchains which we've now fixed. There's two PCI commits both acked by Bjorn. There's one commit to mm/hugepage.c which is (co)authored by Kirill. Highlights: - Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V - Add POWER9 cputable entry from Michael Neuling - FPU/Altivec/VSX save/restore optimisations from Cyril Bur - Add support for new ftrace ABI on ppc64le from Torsten Duwe Various cleanups & minor fixes from: - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh. General: - atomics: Allow architectures to define their own __atomic_op_* helpers from Boqun Feng - Implement atomic{, 64}_*_return_* variants and acquire/release/ relaxed variants for (cmp)xchg from Boqun Feng - Add powernv_defconfig from Jeremy Kerr - Fix BUG_ON() reporting in real mode from Balbir Singh - Add xmon command to dump OPAL msglog from Andrew Donnellan - Add xmon command to dump process/task similar to ps(1) from Douglas Miller - Clean up memory hotplug failure paths from David Gibson pci/eeh: - Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei Yang. - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan. - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang - PCI: Add pcibios_bus_add_device() weak function from Wei Yang - MAINTAINERS: Update EEH details and maintainership from Russell Currey cxl: - Support added to the CXL driver for running on both bare-metal and hypervisor systems, from Christophe Lombard and Frederic Barrat. - Ignore probes for virtual afu pci devices from Vaibhav Jain perf: - Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu - hv-24x7: Fix usage with chip events, display change in counter values, display domain indices in sysfs, eliminate domain suffix in event names, from Sukadev Bhattiprolu Freescale: - Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and other dt bits, and minor fixes/cleanup" * tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits) powerpc: Fix unrecoverable SLB miss during restore_math() powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers powerpc/rcpm: Fix build break when SMP=n powerpc/book3e-64: Use hardcoded mttmr opcode powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible powerpc/T104xRDB: add tdm riser card node to device tree powerpc32: PAGE_EXEC required for inittext powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s) powerpc/86xx: Introduce and use common dtsi powerpc/86xx: Update device tree powerpc/86xx: Move dts files to fsl directory powerpc/86xx: Switch to kconfig fragments approach powerpc/86xx: Update defconfigs powerpc/86xx: Consolidate common platform code powerpc32: Remove one insn in mulhdu powerpc32: small optimisation in flush_icache_range() powerpc: Simplify test in __dma_sync() powerpc32: move xxxxx_dcache_range() functions inline powerpc32: Remove clear_pages() and define clear_page() inline ...
2016-03-17powerpc: query dynamic DEBUG_PAGEALLOC settingJoonsoo Kim1-16/+20
We can disable debug_pagealloc processing even if the code is compiled with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query whether it is enabled or not in runtime. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-02powerpc/mm: Split hash page table sizing heuristic into a helperDavid Gibson1-13/+21
htab_get_table_size() either retrieve the size of the hash page table (HPT) from the device tree - if the HPT size is determined by firmware - or uses a heuristic to determine a good size based on RAM size if the kernel is responsible for allocating the HPT. To support a PAPR extension allowing resizing of the HPT, we're going to want the memory size -> HPT size logic elsewhere, so split it out into a helper function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01powerpc/mm: Clean up memory hotplug failure pathsDavid Gibson1-5/+10
This makes a number of cleanups to handling of mapping failures during memory hotplug on Power: For errors creating the linear mapping for the hot-added region: * This is now reported with EFAULT which is more appropriate than the previous EINVAL (the failure is unlikely to be related to the function's parameters) * An error in this path now prints a warning message, rather than just silently failing to add the extra memory. * Previously a failure here could result in the region being partially mapped. We now clean up any partial mapping before failing. For errors creating the vmemmap for the hot-added region: * This is now reported with EFAULT instead of causing a BUG() - this could happen for external reason (e.g. full hash table) so it's better to handle this non-fatally * An error message is also printed, so the failure won't be silent * As above a failure could cause a partially mapped region, we now clean this up. [mpe: move htab_remove_mapping() out of #ifdef CONFIG_MEMORY_HOTPLUG to enable this] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01powerpc/mm: Handle removing maybe-present bolted HPTEsDavid Gibson1-3/+12
At the moment the hpte_removebolted callback in ppc_md returns void and will BUG_ON() if the hpte it's asked to remove doesn't exist in the first place. This is awkward for the case of cleaning up a mapping which was partially made before failing. So, we add a return value to hpte_removebolted, and have it return ENOENT in the case that the HPTE to remove didn't exist in the first place. In the (sole) caller, we propagate errors in hpte_removebolted to its caller to handle. However, we handle ENOENT specially, continuing to complete the unmapping over the specified range before returning the error to the caller. This means that htab_remove_mapping() will work sanely on a partially present mapping, removing any HPTEs which are present, while also returning ENOENT to its caller in case it's important there. There are two callers of htab_remove_mapping(): - In remove_section_mapping() we already WARN_ON() any error return, which is reasonable - in this case the mapping should be fully present - In vmemmap_remove_mapping() we BUG_ON() any error. We change that to just a WARN_ON() in the case of ENOENT, since failing to remove a mapping that wasn't there in the first place probably shouldn't be fatal. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01powerpc/mm: Clean up error handling for htab_remove_mappingDavid Gibson1-7/+6
Currently, the only error that htab_remove_mapping() can report is -EINVAL, if removal of bolted HPTEs isn't implemeted for this platform. We make a few clean ups to the handling of this: * EINVAL isn't really the right code - there's nothing wrong with the function's arguments - use ENODEV instead * We were also printing a warning message, but that's a decision better left up to the callers, so remove it * One caller is vmemmap_remove_mapping(), which will just BUG_ON() on error, making the warning message redundant, so no change is needed there. * The other caller is remove_section_mapping(). This is called in the memory hot remove path at a point after vmemmap_remove_mapping() so if hpte_removebolted isn't implemented, we'd expect to have already BUG()ed anyway. Put a WARN_ON() here, in lieu of a printk() since this really shouldn't be happening. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-27powerpc/mm/book3s-64: Clean up some obsolete or misleading commentsPaul Mackerras1-5/+5
No code changes. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-01-09powerpc: Fix build break due to paca mm_context_t changesMichael Ellerman1-1/+1
Commit 2fc251a8dda5 ("powerpc: Copy only required pieces of the mm_context_t to the paca") broke the build for CONFIG_PPC_STD_MMU_64=y and CONFIG_PPC_MM_SLICES=n. That only happens for a kernel built with 4K pages and HUGETLB disabled, which is why we missed it. Fix it by adding a mm_ctx_user_psize member to the paca and populating it in the appropriate places. Fixes: 2fc251a8dda5 ("powerpc: Copy only required pieces of the mm_context_t to the paca") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-27powerpc: Copy only required pieces of the mm_context_t to the pacaMichael Neuling1-2/+2
Currently we copy the whole mm_context_t to the paca but only access a few bits of it. This is wasteful of space paca and also takes quite some time in the hot path of context switching. This patch pulls in only the required bits from the mm_context_t to the paca and on context switch, copies only those. Benchmarking this (On top of Anton's recent MSR context switching changes [1]) using processes and yield shows an improvement of almost 3% on POWER8: http://ozlabs.org/~anton/junkcode/context_switch2.c ./context_switch2 --test=yield --process 0 0 1. https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-October/135700.html Signed-off-by: Michael Neuling <mikey@neuling.org> [mpe: Rename paca fields to be mm_ctx_foo rather than context_foo] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-19powerpc: Add function to copy mm_context_t to the pacaMichael Neuling1-2/+3
This adds a function to copy the mm->context to the paca. This is only a basic conversion for now but will be used more extensively in the next patch. This also adds #ifdef CONFIG_PPC_BOOK3S around this code since it's not used elsewhere. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14powerpc/mm: Move WIMG update to helper.Aneesh Kumar K.V1-1/+12
Only difference here is, we apply the WIMG mapping early, so rflags passed to updatepp will also be changed. Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14powerpc/mm: Add helper for converting pte bit to hpte bitsAneesh Kumar K.V1-8/+14
Instead of open coding it in multiple code paths, export the helper and add more documentation. Also make sure we don't make assumption regarding pte bit position Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14powerpc/mm: Convert 4k insert from asm to CAneesh Kumar K.V1-26/+0
This is similar to 64K insert. May be we want to consolidate Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14powerpc/mm: Convert __hash_page_64K to CAneesh Kumar K.V1-18/+1
Convert from asm to C Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14powerpc/mm: Remove the dependency on pte bit position in asm codeAneesh Kumar K.V1-0/+29
We should not expect pte bit position in asm code. Simply by moving part of that to C Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14powerpc/mm: Convert 4k hash insert to CAneesh Kumar K.V1-1/+3
Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-14powerpc/mm: Fix infinite loop in hash fault with 4K page sizeAneesh Kumar K.V1-1/+2
This is the same bug we fixed as part of 09567e7fd44291bfc08accfdd67ad8f467842332 ("powerpc/mm: Check paca psize is up to date for huge mappings"). Please check that for details. The difference here is that faults were happening on a 4K page at an address previously mapped by hugetlb. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-12powerpc/mm: Differentiate between hugetlb and THP during page walkAneesh Kumar K.V1-3/+4
We need to properly identify whether a hugepage is an explicit or a transparent hugepage in follow_huge_addr(). We used to depend on hugepage shift argument to do that. But in some case that can result in wrong results. For ex: On finding a transparent hugepage we set hugepage shift to PMD_SHIFT. But we can end up clearing the thp pte, via pmdp_huge_get_and_clear. We do prevent reusing the pfn page via the usage of kick_all_cpus_sync(). But that happens after we updated the pte to 0. Hence in follow_huge_addr() we can find hugepage shift set, but transparent huge page check fail for a thp pte. NOTE: We fixed a variant of this race against thp split in commit 691e95fd7396905a38d98919e9c150dbc3ea21a3 ("powerpc/mm/thp: Make page table walk safe against thp split/collapse") Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in follow_page_mask occasionally. In the long term, we may want to switch ppc64 64k page size config to enable CONFIG_ARCH_WANT_GENERAL_HUGETLB Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-18powerpc/mm: Drop CONFIG_PPC_HAS_HASH_64KMichael Ellerman1-6/+6
The relation between CONFIG_PPC_HAS_HASH_64K and CONFIG_PPC_64K_PAGES is painfully complicated. But if we rearrange it enough we can see that PPC_HAS_HASH_64K essentially depends on PPC_STD_MMU_64 && PPC_64K_PAGES. We can then notice that PPC_HAS_HASH_64K is used in files that are only built for PPC_STD_MMU_64, meaning it's equivalent to PPC_64K_PAGES. So replace all uses and drop it. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2015-06-10powerpc/mm: Add trace point for tracking hash pte faultAneesh Kumar K.V1-0/+2
This enables us to understand how many hash fault we are taking when running benchmarks. For ex: -bash-4.2# ./perf stat -e powerpc:hash_fault -e page-faults /tmp/ebizzy.ppc64 -S 30 -P -n 1000 ... Performance counter stats for '/tmp/ebizzy.ppc64 -S 30 -P -n 1000': 1,10,04,075 powerpc:hash_fault 1,10,03,429 page-faults 30.865978991 seconds time elapsed NOTE: The impact of the tracepoint was not noticeable when running test. It was within the run-time variance of the test. For ex: without-patch: -------------- Performance counter stats for './a.out 3000 300': 643 page-faults # 0.089 M/sec 7.236562 task-clock (msec) # 0.928 CPUs utilized 2,179,213 stalled-cycles-frontend # 0.00% frontend cycles idle 17,174,367 stalled-cycles-backend # 0.00% backend cycles idle 0 context-switches # 0.000 K/sec 0.007794658 seconds time elapsed And with-patch: --------------- Performance counter stats for './a.out 3000 300': 643 page-faults # 0.089 M/sec 7.233746 task-clock (msec) # 0.921 CPUs utilized 0 context-switches # 0.000 K/sec 0.007854876 seconds time elapsed Performance counter stats for './a.out 3000 300': 643 page-faults # 0.087 M/sec 649 powerpc:hash_fault # 0.087 M/sec 7.430376 task-clock (msec) # 0.938 CPUs utilized 2,347,174 stalled-cycles-frontend # 0.00% frontend cycles idle 17,524,282 stalled-cycles-backend # 0.00% backend cycles idle 0 context-switches # 0.000 K/sec 0.007920284 seconds time elapsed Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-02powerpc/mm: Fix build break with STRICT_MM_TYPECHECKS && DEBUG_PAGEALLOCMichael Ellerman1-1/+1
If both STRICT_MM_TYPECHECKS and DEBUG_PAGEALLOC are enabled, the code in kernel_map_linear_page() is built, and so we fail with: arch/powerpc/mm/hash_utils_64.c:1478:2: error: incompatible type for argument 1 of 'htab_convert_pte_flags' Fix it by using pgprot_val(). Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-23powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabledAneesh Kumar K.V1-0/+1
This fix the below build error arch/powerpc/mm/hash_utils_64.c: In function ‘flush_hash_hugepage’: arch/powerpc/mm/hash_utils_64.c:1381:1: error: label at end of compound statement tm_abort: ^ make[1]: *** [arch/powerpc/mm/hash_utils_64.o] Error 1 Reported-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-17powerpc/mm/thp: Make page table walk safe against thp split/collapseAneesh Kumar K.V1-1/+1
We can disable a THP split or a hugepage collapse by disabling irq. We do send IPI to all the cpus in the early part of split/collapse, and disabling local irq ensure we don't make progress with split/collapse. If the THP is getting split we return NULL from find_linux_pte_or_hugepte(). For all the current callers it should be ok. We need to be careful if we want to use returned pte_t pointer outside the irq disabled region. W.r.t to THP split, the pfn remains the same, but then a hugepage collapse will result in a pfn change. There are few steps we can take to avoid a hugepage collapse.One way is to take page reference inside the irq disable region. Other option is to take mmap_sem so that a parallel collapse will not happen. We can also disable collapse by taking pmd_lock. Another method used by kvm subsystem is to check whether we had a mmu_notifer update in between using mmu_notifier_retry(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-13mm/debug-pagealloc: make debug-pagealloc boottime configurableJoonsoo Kim1-1/+1
Now, we have prepared to avoid using debug-pagealloc in boottime. So introduce new kernel-parameter to disable debug-pagealloc in boottime, and makes related functions to be disabled in this case. Only non-intuitive part is change of guard page functions. Because guard page is effective only if debug-pagealloc is enabled, turning off according to debug-pagealloc is reasonable thing to do. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-05powerpc/mm: don't do tlbie for updatepp request with NO HPTE faultAneesh Kumar K.V1-16/+28
upatepp can get called for a nohpte fault when we find from the linux page table that the translation was hashed before. In that case we are sure that there is no existing translation, hence we could avoid doing tlbie. We could possibly race with a parallel fault filling the TLB. But that should be ok because updatepp is only ever relaxing permissions. We also look at linux pte permission bits when filling hash pte permission bits. We also hold the linux pte busy bits while inserting/updating a hashpte entry, hence a paralle update of linux pte is not possible. On the other hand mprotect involves ptep_modify_prot_start which cause a hpte invalidate and not updatepp. Performance number: We use randbox_access_bench written by Anton. Kernel with THP disabled and smaller hash page table size. 86.60% random_access_b [kernel.kallsyms] [k] .native_hpte_updatepp 2.10% random_access_b random_access_bench [.] doit 1.99% random_access_b [kernel.kallsyms] [k] .do_raw_spin_lock 1.85% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 1.26% random_access_b [kernel.kallsyms] [k] .native_flush_hash_range 1.18% random_access_b [kernel.kallsyms] [k] .__delay 0.69% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 0.37% random_access_b [kernel.kallsyms] [k] .clear_user_page 0.34% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 0.32% random_access_b [kernel.kallsyms] [k] fast_exception_return 0.30% random_access_b [kernel.kallsyms] [k] .hash_page_mm With Fix: 27.54% random_access_b random_access_bench [.] doit 22.90% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 5.76% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 5.20% random_access_b [kernel.kallsyms] [k] fast_exception_return 5.12% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 4.80% random_access_b [kernel.kallsyms] [k] .hash_page_mm 3.31% random_access_b [kernel.kallsyms] [k] data_access_common 1.84% random_access_b [kernel.kallsyms] [k] .trace_hardirqs_on_caller Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-02powerpc/mm/thp: Use tlbiel if possibleAneesh Kumar K.V1-5/+23
If we know that user address space has never executed on other cpus we could use tlbiel. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-12-02powerpc/mm/thp: Remove code duplicationAneesh Kumar K.V1-0/+52
Rename invalidate_old_hpte to flush_hash_hugepage and use that in other places. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-11-03powerpc: Replace __get_cpu_var usesChristoph Lameter1-1/+1
This still has not been merged and now powerpc is the only arch that does not have this change. Sorry about missing linuxppc-dev before. V2->V2 - Fix up to work against 3.18-rc1 __get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. At the end of the patch set all uses of __get_cpu_var have been removed so the macro is removed too. The patch set includes passes over all arches as well. Once these operations are used throughout then specialized macros can be defined in non -x86 arches as well in order to optimize per cpu access by f.e. using a global register that may be set to the per cpu base. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to __this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to __this_cpu_inc(y) Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> Signed-off-by: Christoph Lameter <cl@linux.com> [mpe: Fix build errors caused by set/or_softirq_pending(), and rework assignment in __set_breakpoint() to use memcpy().] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-08powerpc/mm: Add new hash_page_mm()Ian Munsie1-7/+17
This adds a new function hash_page_mm() based on the existing hash_page(). This version allows any struct mm to be passed in, rather than assuming current. This is useful for servicing co-processor faults which are not in the context of the current running process. We need to be careful here as the current hash_page() assumes current in a few places. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-08powerpc/mm: Export mmu_kernel_ssize and mmu_linear_psizeIan Munsie1-0/+2
Export mmu_kernel_ssize and mmu_linear_psize. These are needed by the cxl driver which has it's own MMU. To setup the MMU cxl needs access to these. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-08powerpc/cell: Make spu_flush_all_slbs() genericIan Munsie1-7/+3
This moves spu_flush_all_slbs() into a generic call copro_flush_all_slbs(). This will be useful when we add cxl which also needs a similar SLB flush call. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-25powerpc/mm: Unindent htab_dt_scan_page_sizes()Michael Ellerman1-61/+60
We can unindent the bulk of htab_dt_scan_page_sizes() by returning early if the property is not found. That is nice in and of itself, but also has the advantage of making it clear that we always return success once we have found the ibm,segment-page-sizes property. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-25powerpc: Make a bunch of things staticAnton Blanchard1-1/+1
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-25powerpc: Move more symbol exports next to function definitionsAnton Blanchard1-0/+1
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-08-05powerpc: implement vmemmap_remove_mapping() for BOOK3SLi Zhong1-1/+1
This is to be called in vmemmap_free(), leave the implementation on BOOK3E empty as before. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-28powerpc: Remove MMU_FTR_SLBMichael Ellerman1-4/+2
We now only support cpus that use an SLB, so we don't need an MMU feature to indicate that. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-28powerpc: Remove STAB codeMichael Ellerman1-14/+4
Old cpus didn't have a Segment Lookaside Buffer (SLB), instead they had a Segment Table (STAB). Now that we've dropped support for those cpus, we can remove the STAB support entirely. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-06-10Merge branch 'next' of ↵Linus Torvalds1-34/+58
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "Here is the bulk of the powerpc changes for this merge window. It got a bit delayed in part because I wasn't paying attention, and in part because I discovered I had a core PCI change without a PCI maintainer ack in it. Bjorn eventually agreed it was ok to merge it though we'll probably improve it later and I didn't want to rebase to add his ack. There is going to be a bit more next week, essentially fixes that I still want to sort through and test. The biggest item this time is the support to build the ppc64 LE kernel with our new v2 ABI. We previously supported v2 userspace but the kernel itself was a tougher nut to crack. This is now sorted mostly thanks to Anton and Rusty. We also have a fairly big series from Cedric that add support for 64-bit LE zImage boot wrapper. This was made harder by the fact that traditionally our zImage wrapper was always 32-bit, but our new LE toolchains don't really support 32-bit anymore (it's somewhat there but not really "supported") so we didn't want to rely on it. This meant more churn that just endian fixes. This brings some more LE bits as well, such as the ability to run in LE mode without a hypervisor (ie. under OPAL firmware) by doing the right OPAL call to reinitialize the CPU to take HV interrupts in the right mode and the usual pile of endian fixes. There's another series from Gavin adding EEH improvements (one day we *will* have a release with less than 20 EEH patches, I promise!). Another highlight is the support for the "Split core" functionality on P8 by Michael. This allows a P8 core to be split into "sub cores" of 4 threads which allows the subcores to run different guests under KVM (the HW still doesn't support a partition per thread). And then the usual misc bits and fixes ..." [ Further delayed by gmail deciding that BenH is a dirty spammer. Google knows. ] * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (155 commits) powerpc/powernv: Add missing include to LPC code selftests/powerpc: Test the THP bug we fixed in the previous commit powerpc/mm: Check paca psize is up to date for huge mappings powerpc/powernv: Pass buffer size to OPAL validate flash call powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC() powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC() powerpc/powernv: Set memory_block_size_bytes to 256MB powerpc: Allow ppc_md platform hook to override memory_block_size_bytes powerpc/powernv: Fix endian issues in memory error handling code powerpc/eeh: Skip eeh sysfs when eeh is disabled powerpc: 64bit sendfile is capped at 2GB powerpc/powernv: Provide debugfs access to the LPC bus via OPAL powerpc/serial: Use saner flags when creating legacy ports powerpc: Add cpu family documentation powerpc/xmon: Fix up xmon format strings powerpc/powernv: Add calls to support little endian host powerpc: Document sysfs DSCR interface powerpc: Fix regression of per-CPU DSCR setting powerpc: Split __SYSFS_SPRSETUP macro arch: powerpc/fadump: Cleaning up inconsistent NULL checks ...
2014-06-06powerpc/mm: Check paca psize is up to date for huge mappingsMichael Ellerman1-11/+20
We have a bug in our hugepage handling which exhibits as an infinite loop of hash faults. If the fault is being taken in the kernel it will typically trigger the softlockup detector, or the RCU stall detector. The bug is as follows: 1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..) 2. Slice code converts the slice psize to 16M. 3. The code on lines 539-540 of slice.c in slice_get_unmapped_area() synchronises the mm->context with the paca->context. So the paca slice mask is updated to include the 16M slice. 3. Either: * mmap() fails because there are no huge pages available. * mmap() succeeds and the mapping is then munmapped. In both cases the slice psize remains at 16M in both the paca & mm. 4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..) 5. The slice psize is converted back to 64K. Because of the check on line 539 of slice.c we DO NOT update the paca->context. The paca slice mask is now out of sync with the mm slice mask. 6. User/kernel accesses 0xa0000000. 7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask** to create an SLB entry and inserts it in the SLB. 18. With the 16M SLB entry in place the hardware does a hash lookup, no entry is found so a data access exception is generated. 19. The data access handler calls do_page_fault() -> handle_mm_fault(). 10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page(). 11. The hardware retries the access, there is still nothing in the hash table so once again a data access exception is generated. 12. hash_page() calls into __hash_page_thp() and inserts a mapping in the hash. Although the THP mapping maps 16M the hashing is done using 64K as the segment page size. 13. hash_page() returns immediately after calling __hash_page_thp(), skipping over the code at line 1125. Resulting in the mismatch between the paca->context and mm->context not being detected. 14. The hardware retries the access, the hash it generates using the 16M SLB entry does NOT match the hash we inserted. 15. We take another data access and go into __hash_page_thp(). 16. We see a valid entry in the hpte_slot_array and so we call updatepp() which succeeds. 17. Goto 14. We could fix this in two ways. The first would be to remove or modify the check on line 539 of slice.c. The second option is to cause the check of paca psize in hash_page() on line 1125 to also be done for THP pages. We prefer the latter, because the check & update of the paca psize is not done until we know it's necessary. It's also done only on the current cpu, so we don't need to IPI all other cpus. Without further rearranging the code, the simplest fix is to pull out the code that checks paca psize and call it in two places. Firstly for THP/hugetlb, and secondly for other mappings as before. Thanks to Dave Jones for trinity, which originally found this bug. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: stable@vger.kernel.org [v3.11+]
2014-06-04Merge tag 'devicetree-for-3.16' of ↵Linus Torvalds1-11/+11
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux into next Pull DeviceTree updates from Rob Herring: - Another round of clean-up of FDT related code in architecture code. This removes knowledge of internal FDT details from most architectures except powerpc. - Conversion of kernel's custom FDT parsing code to use libfdt. - DT based initialization for generic serial earlycon. The introduction of generic serial earlycon support went in through the tty tree. - Improve the platform device naming for DT probed devices to ensure unique naming and use parent names instead of a global index. - Fix a race condition in of_update_property. - Unify the various linker section OF match tables and fix several function prototype errors. - Update platform_get_irq_byname to work in deferred probe cases. - 2 binding doc updates * tag 'devicetree-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (58 commits) of: handle NULL node in next_child iterators of/irq: provide more wrappers for !CONFIG_OF devicetree: bindings: Document micrel vendor prefix dt: bindings: dwc2: fix required value for the phy-names property of_pci_irq: kill useless variable in of_irq_parse_pci() of/irq: do irq resolution in platform_get_irq_byname() of: Add a testcase for of_find_node_by_path() of: Make of_find_node_by_path() handle /aliases of: Create unlocked version of for_each_child_of_node() lib: add glibc style strchrnul() variant of: Handle memory@0 node on PPC32 only pci/of: Remove dead code of: fix race between search and remove in of_update_property() of: Use NULL for pointers of: Stop naming platform_device using dcr address of: Ensure unique names without sacrificing determinism tty/serial: pl011: add DT based earlycon support of/fdt: add FDT serial scanning for earlycon of/fdt: add FDT address translation support serial: earlycon: add DT support ...
2014-05-05Merge remote-tracking branch 'anton/abiv2' into nextBenjamin Herrenschmidt1-20/+16
This series adds support for building the powerpc 64-bit LE kernel using the new ABI v2. We already supported running ABI v2 userspace programs but this adds support for building the kernel itself using the new ABI.
2014-05-01powerpc: Use 64k io pages when we never see an HEAAlexander Graf1-3/+22
When we never get around to seeing an HEA ethernet adapter, there's no point in restricting ourselves to 4k IO page size. This speeds up IO maps when CONFIG_IBMEBUS is disabled. [ Updated the test to also lift the restriction on arch 2.07 (Power 8) which cannot have an HEA -- BenH ] Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> foo
2014-04-30of/fdt: update of_get_flat_dt_prop in prep for libfdtRob Herring1-11/+11
Make of_get_flat_dt_prop arguments compatible with libfdt fdt_getprop call in preparation to convert FDT code to use libfdt. Make the return value const and the property length ptr type an int. Signed-off-by: Rob Herring <robh@kernel.org> Tested-by: Michal Simek <michal.simek@xilinx.com> Tested-by: Grant Likely <grant.likely@linaro.org> Tested-by: Stephen Chivers <schivers@csc.com>
2014-04-29KVM guest: Make pv trampoline code executableAlexander Graf1-0/+4
Our PV guest patching code assembles chunks of instructions on the fly when it encounters more complicated instructions to hijack. These instructions need to live in a section that we don't mark as non-executable, as otherwise we fault when jumping there. Right now we put it into the .bss section where it automatically gets marked as non-executable. Add a check to the NX setting function to ensure that we leave these particular pages executable. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-04-23powerpc: Fix branch patching code for ABIv2Anton Blanchard1-11/+9
The MMU hashtable and SLB branch patching code uses function pointers for the update sites. This creates a difference between ABIv1 and ABIv2 because we don't have function descriptors on ABIv2. Get rid of the function pointer and just point at the update sites directly. This works on both ABIs. Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23powerpc: Use ppc_function_entry instead of open coding itAnton Blanchard1-10/+8
Replace FUNCTION_TEXT with ppc_function_entry which can handle both ABIv1 and ABIv2. Signed-off-by: Anton Blanchard <anton@samba.org>
2014-02-11powerpc: Fix kdump hang issue on p8 with relocation on exception enabled.Mahesh Salgaonkar1-0/+14
On p8 systems, with relocation on exception feature enabled we are seeing kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this feature enabled, exception are raised with MMU (IR=DR=1) ON with the default offset of 0xc*4000. Since exception is raised in virtual mode it requires the vector region to be executable without which it fails to fetch and execute instruction at 0xc*4xxx. For default kernel since kernel is loaded at real 0, the htab mappings sets the entire kernel text region executable. But for relocatable kernel (e.g. kdump case) we only copy interrupt vectors down to real 0 and never marked that region as executable because in p7 and below we always get exception in real mode. This patch fixes this issue by marking htab mapping range as executable that overlaps with the interrupt vector region for relocatable kernel. Thanks to Ben who helped me to debug this issue and find the root cause. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-12-09powerpc/mm: Free up _PAGE_COHERENCE for numa fault use laterAneesh Kumar K.V1-3/+4
Set memory coherence always on hash64 config. If a platform cannot have memory coherence always set they can infer that from _PAGE_NO_CACHE and _PAGE_WRITETHRU like in lpar. So we dont' really need a separate bit for tracking _PAGE_COHERENCE. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-11powerpc: Book 3S MMU little endian supportAnton Blanchard1-20/+18
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14powerpc: Fix a number of sparse warningsAnton Blanchard1-1/+1
Address some of the trivial sparse warnings in arch/powerpc. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01powerpc: Delete __cpuinit usage from all usersPaul Gortmaker1-1/+1
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the powerpc uses of the __cpuinit macros. There are no __CPUINIT users in assembly files in powerpc. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Josh Boyer <jwboyer@gmail.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Make linux pagetable walk safe with THP enabledAneesh Kumar K.V1-12/+15
We need to have irqs disabled to handle all the possible parallel update for linux page table without holding locks. Events that we are intersted in while walking page tables are 1) Page fault 2) umap 3) THP split 4) THP collapse A) local_irq_disabled: ------------------------ 1) page fault: A none to valid transition via page fault is not an issue because we would either see a none or valid. If it is none, we would error out the page table walk. We may need to use on stack values when checking for type of page table elements, because if we do if (!is_hugepd()) { if (!pmd_none() { if (pmd_bad() { We could take that bad condition because the pmd got converted to a hugepd after the !is_hugepd check via a hugetlb fault. The right way would be to check for pmd_none higher up or use on stack value. 2) A valid to none conversion via unmap: We can safely walk the upper level table, because we don't remove the the page table entries until rcu grace period. So even if we followed a wrong pointer we still have the pointer valid till the grace period. A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and _PAGE_BUSY. A valid pointer returned could becoming none later. To prevent pte_clear we take _PAGE_BUSY. 3) THP split: A valid transparent hugepage is converted to nomal page. Before we split we do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING So when walking page table we need to check for pmd_trans_splitting and handle that. The pte returned should also need to be checked for _PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save the value of PTE on stack and check for the flag in the local pte value. If we don't have the value set we can safely operate on the local pte value and we atomicaly set _PAGE_BUSY. 4) THP collapse: A normal page gets converted to hugepage. In the collapse path, we mark the pmd none early (pmdp_clear_flush). With irq disabled, if we are aleady walking page table we would see the pmd_none and won't continue. If we see a valid PMD, we should still check for _PAGE_PRESENT before setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc/THP: Add code to handle HPTE faults for hugepagesAneesh Kumar K.V1-4/+17
The deposted PTE page in the second half of the PMD table is used to track the state on hash PTEs. After updating the HPTE, we mark the coresponding slot in the deposted PTE page valid. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Replace find_linux_pte with find_linux_pte_or_hugepteAneesh Kumar K.V1-1/+7
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly document why we don't need to handle transparent hugepages at callsites. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc/mm: handle hugepage size correctly when invalidating hpte entriesAneesh Kumar K.V1-2/+7
If a hash bucket gets full, we "evict" a more/less random entry from it. When we do that we don't invalidate the TLB (hpte_remove) because we assume the old translation is still technically "valid". This implies that when we are invalidating or updating pte, even if HPTE entry is not valid we should do a tlb invalidate. With hugepages, we need to pass the correct actual page size value for tlb invalidation. This change update the patch 0608d692463598c1d6e826d9dd7283381b4f246c "powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle transparent hugepages correctly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14powerpc: Exception hooks for context tracking subsystemLi Zhong1-9/+27
This is the exception hooks for context tracking subsystem, including data access, program check, single step, instruction breakpoint, machine check, alignment, fp unavailable, altivec assist, unknown exception, whose handlers might use RCU. This patch corresponds to [PATCH] x86: Exception hooks for userspace RCU extended QS commit 6ba3c97a38803883c2eee489505796cb0a727122 But after the exception handling moved to generic code, and some changes in following two commits: 56dd9470d7c8734f055da2a6bac553caf4a468eb context_tracking: Move exception handling to generic code 6c1e0256fad84a843d915414e4b5973b7443d48d context_tracking: Restore correct previous context state on exception exit it is able for exception hooks to use the generic code above instead of a redundant arch implementation. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-06powerpc/tm: Fix null pointer deference in flush_hash_pageMichael Neuling1-0/+1
Make sure that current->thread.reg exists before we deference it in flush_hash_page. Signed-off-by: Michael Neuling <mikey@neuling.org> Reported-by: John J Miller <millerjo@us.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Print page size info during bootAneesh Kumar K.V1-5/+5
This gives hint about different base and actual page size combination supported by the platform. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: print both base and actual page size on hash failureAneesh Kumar K.V1-5/+7
Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Decode the pte-lp-encoding bits correctly.Aneesh Kumar K.V1-44/+81
We look at both the segment base page size and actual page size and store the pte-lp-encodings in an array per base page size. We also update all relevant functions to take actual page size argument so that we can use the correct PTE LP encoding in HPTE. This should also get the basic Multiple Page Size per Segment (MPSS) support. This is needed to enable THP on ppc64. [Fixed PR KVM build --BenH] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-18powerpc: Try to insert the hptes repeatedly in kernel_map_linear_page()Li Zhong1-6/+7
This patch fixes the following oops, which could be trigged by build the kernel with many concurrent threads, under CONFIG_DEBUG_PAGEALLOC. hpte_insert() might return -1, indicating that the bucket (primary here) is full. We are not necessarily reporting a BUG in this case. Instead, we could try repeatedly (try secondary, remove and try again) until we find a slot. [ 543.075675] ------------[ cut here ]------------ [ 543.075701] kernel BUG at arch/powerpc/mm/hash_utils_64.c:1239! [ 543.075714] Oops: Exception in kernel mode, sig: 5 [#1] [ 543.075722] PREEMPT SMP NR_CPUS=16 DEBUG_PAGEALLOC NUMA pSeries [ 543.075741] Modules linked in: binfmt_misc ehea [ 543.075759] NIP: c000000000036eb0 LR: c000000000036ea4 CTR: c00000000005a594 [ 543.075771] REGS: c0000000a90832c0 TRAP: 0700 Not tainted (3.8.0-next-20130222) [ 543.075781] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 22224482 XER: 00000000 [ 543.075816] SOFTE: 0 [ 543.075823] CFAR: c00000000004c200 [ 543.075830] TASK = c0000000e506b750[23934] 'cc1' THREAD: c0000000a9080000 CPU: 1 GPR00: 0000000000000001 c0000000a9083540 c000000000c600a8 ffffffffffffffff GPR04: 0000000000000050 fffffffffffffffa c0000000a90834e0 00000000004ff594 GPR08: 0000000000000001 0000000000000000 000000009592d4d8 c000000000c86854 GPR12: 0000000000000002 c000000006ead300 0000000000a51000 0000000000000001 GPR16: f000000003354380 ffffffffffffffff ffffffffffffff80 0000000000000000 GPR20: 0000000000000001 c000000000c600a8 0000000000000001 0000000000000001 GPR24: 0000000003354380 c000000000000000 0000000000000000 c000000000b65950 GPR28: 0000002000000000 00000000000cd50e 0000000000bf50d9 c000000000c7c230 [ 543.076005] NIP [c000000000036eb0] .kernel_map_pages+0x1e0/0x3f8 [ 543.076016] LR [c000000000036ea4] .kernel_map_pages+0x1d4/0x3f8 [ 543.076025] Call Trace: [ 543.076033] [c0000000a9083540] [c000000000036ea4] .kernel_map_pages+0x1d4/0x3f8 (unreliable) [ 543.076053] [c0000000a9083640] [c000000000167638] .get_page_from_freelist+0x6cc/0x8dc [ 543.076067] [c0000000a9083800] [c000000000167a48] .__alloc_pages_nodemask+0x200/0x96c [ 543.076082] [c0000000a90839c0] [c0000000001ade44] .alloc_pages_vma+0x160/0x1e4 [ 543.076098] [c0000000a9083a80] [c00000000018ce04] .handle_pte_fault+0x1b0/0x7e8 [ 543.076113] [c0000000a9083b50] [c00000000018d5a8] .handle_mm_fault+0x16c/0x1a0 [ 543.076129] [c0000000a9083c00] [c0000000007bf1dc] .do_page_fault+0x4d0/0x7a4 [ 543.076144] [c0000000a9083e30] [c0000000000090e8] handle_page_fault+0x10/0x30 [ 543.076155] Instruction dump: [ 543.076163] 7c630038 78631d88 e80a0000 f8410028 7c0903a6 e91f01de e96a0010 e84a0008 [ 543.076192] 4e800421 e8410028 7c7107b4 7a200fe0 <0b000000> 7f63db78 48785781 60000000 [ 543.076224] ---[ end trace bd5807e8d6ae186b ]--- Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
2013-04-18powerpc: Split the code trying to insert hpte repeatedly as an helper functionLi Zhong1-0/+35
Move the logic trying to insert hpte in __hash_page_huge() to an helper function, so it could also be used by others. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
2013-03-17powerpc: Update kernel VSID rangeAneesh Kumar K.V1-5/+15
This patch change the kernel VSID range so that we limit VSID_BITS to 37. This enables us to support 64TB with 65 bit VA (37+28). Without this patch we have boot hangs on platforms that only support 65 bit VA. With this patch we now have proto vsid generated as below: We first generate a 37-bit "proto-VSID". Proto-VSIDs are generated from mmu context id and effective segment id of the address. For user processes max context id is limited to ((1ul << 19) - 5) for kernel space, we use the top 4 context ids to map address as below 0x7fffc - [ 0xc000000000000000 - 0xc0003fffffffffff ] 0x7fffd - [ 0xd000000000000000 - 0xd0003fffffffffff ] 0x7fffe - [ 0xe000000000000000 - 0xe0003fffffffffff ] 0x7ffff - [ 0xf000000000000000 - 0xf0003fffffffffff ] Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@vger.kernel.org> [v3.8]
2013-03-13powerpc: Fix STAB initializationBenjamin Herrenschmidt1-0/+2
Commit f5339277eb8d3aed37f12a27988366f68ab68930 accidentally removed more than just iSeries bits and took out the call to stab_initialize() thus breaking support for POWER3 processors. Put it back. (Yes, nobody noticed until now ...) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@vger.kernel.org> [v3.4+]
2013-02-15powerpc: Hook in new transactional memory codeMichael Neuling1-0/+16
This hooks the new transactional memory code into context switching, FP/VMX/VMX unavailable and exception return. Signed-off-by: Matt Evans <matt@ozlabs.org> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-17powerpc/mm: Increase the slice range to 64TBAneesh Kumar K.V1-6/+9
This patch makes the high psizes mask as an unsigned char array so that we can have more than 16TB. Currently we support upto 64TB Reviewed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-17powerpc/mm: Convert virtual address to vpnAneesh Kumar K.V1-15/+15
This patch convert different functions to take virtual page number instead of virtual address. Virtual page number is virtual address shifted right by VPN_SHIFT (12) bits. This enable us to have an address range of upto 76 bits. Reviewed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-05powerpc/mm: Replace abs_to_virt() with __va()Michael Ellerman1-2/+1
abs_to_virt() is just a wrapper around __va(), call __va() directly. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-28Disintegrate asm/system.h for PowerPCDavid Howells1-1/+0
Disintegrate asm/system.h for PowerPC. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> cc: linuxppc-dev@lists.ozlabs.org
2012-03-21powerpc: Remove FW_FEATURE ISERIES from arch codeStephen Rothwell1-6/+3
This is no longer selectable, so just remove all the dependent code. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-02-23fadump: Register for firmware assisted dump.Mahesh Salgaonkar1-0/+11
On 2012-02-20 11:02:51 Mon, Paul Mackerras wrote: > On Thu, Feb 16, 2012 at 04:44:30PM +0530, Mahesh J Salgaonkar wrote: > > If I have read the code correctly, we are going to get this printk on > non-pSeries machines or on older pSeries machines, even if the user > has not put the fadump=on option on the kernel command line. The > printk will be annoying since there is no actual error condition. It > seems to me that the condition for the printk should include > fw_dump.fadump_enabled. In other words you should probably add > > if (!fw_dump.fadump_enabled) > return 0; > > at the beginning of the function. Hi Paul, Thanks for pointing it out. Please find the updated patch below. The existing patches above this (4/10 through 10/10) cleanly applies on this update. Thanks, -Mahesh. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-06Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-10-31powerpc: add export.h to files making use of EXPORT_SYMBOLPaul Gortmaker1-0/+1
With module.h being implicitly everywhere via device.h, the absence of explicitly including something for EXPORT_SYMBOL went unnoticed. Since we are heading to fix things up and clean module.h from the device.h file, we need to explicitly include these files now. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-09-20powerpc: Fix oops when echoing bad values to /sys/devices/system/memory/probeAnton Blanchard1-3/+3
If we echo an address the hypervisor doesn't like to /sys/devices/system/memory/probe we oops the box: # echo 0x10000000000 > /sys/devices/system/memory/probe kernel BUG at arch/powerpc/mm/hash_utils_64.c:541! The backtrace is: create_section_mapping arch_add_memory add_memory memory_probe_store sysdev_class_store sysfs_write_file vfs_write SyS_write In create_section_mapping we BUG if htab_bolt_mapping returned an error. A better approach is to return an error which will propagate back to userspace. Rerunning the test with this patch applied: # echo 0x10000000000 > /sys/devices/system/memory/probe -bash: echo: write error: Invalid argument Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20powerpc: Hugetlb for BookEBecky Bruce1-3/+0
Enable hugepages on Freescale BookE processors. This allows the kernel to use huge TLB entries to map pages, which can greatly reduce the number of TLB misses and the amount of TLB thrashing experienced by applications with large memory footprints. Care should be taken when using this on FSL processors, as the number of large TLB entries supported by the core is low (16-64) on current processors. The supported set of hugepage sizes include 4m, 16m, 64m, 256m, and 1g. Page sizes larger than the max zone size are called "gigantic" pages and must be allocated on the command line (and cannot be deallocated). This is currently only fully implemented for Freescale 32-bit BookE processors, but there is some infrastructure in the code for 64-bit BooKE. Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-27powerpc: Free up some CPU feature bits by moving out MMU-related featuresMatt Evans1-9/+9
Some of the 64bit PPC CPU features are MMU-related, so this patch moves them to MMU_FTR_ bits. All cpu_has_feature()-style tests are moved to mmu_has_feature(), and seven feature bits are freed as a result. Signed-off-by: Matt Evans <matt@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-20powerpc: Replace open coded instruction patching with ↵Anton Blanchard1-17/+27
patch_instruction/patch_branch There are a few places we patch instructions without using patch_instruction and patch_branch, probably because they predated it. Fix it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-03-31Fix common misspellingsLucas De Marchi1-1/+1
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2010-11-29powerpc/mm: Avoid avoidable void* pointerMichael Neuling1-1/+1
Change pgdir from a void to real type. Having this as a void is stupid and has already caused 1 bug. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-11-18powerpc: Fix call to subpage_protection()Michael Neuling1-1/+1
In: powerpc/mm: Fix pgtable cache cleanup with CONFIG_PPC_SUBPAGE_PROT commit d28513bc7f675d28b479db666d572e078ecf182d Author: David Gibson <david@gibson.dropbear.id.au> subpage_protection() was changed to to take an mm rather a pgdir but it didn't change calling site in hashpage_preload(). The change wasn't noticed at compile time since hashpage_preload() used a void* as the parameter to subpage_protection(). This is obviously wrong and can trigger the following crash when CONFIG_SLAB, CONFIG_DEBUG_SLAB, CONFIG_PPC_64K_PAGES CONFIG_PPC_SUBPAGE_PROT are enabled. Freeing unused kernel memory: 704k freed Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6c49b7 Faulting instruction address: 0xc0000000000410f4 cpu 0x2: Vector: 300 (Data Access) at [c00000004233f590] pc: c0000000000410f4: .hash_preload+0x258/0x338 lr: c000000000041054: .hash_preload+0x1b8/0x338 sp: c00000004233f810 msr: 8000000000009032 dar: 6b6b6b6b6b6c49b7 dsisr: 40000000 current = 0xc00000007e2c0070 paca = 0xc000000007fe0500 pid = 1, comm = init enter ? for help [c00000004233f810] c000000000041020 .hash_preload+0x184/0x338 (unreliable) [c00000004233f8f0] c00000000003ed98 .update_mmu_cache+0xb0/0xd0 [c00000004233f990] c000000000157754 .__do_fault+0x48c/0x5dc [c00000004233faa0] c000000000158fd0 .handle_mm_fault+0x508/0xa8c [c00000004233fb90] c0000000006acdd4 .do_page_fault+0x428/0x6ac [c00000004233fe30] c000000000005260 handle_page_fault+0x20/0x74 Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-05memblock: Remove rmo_size, burry it in arch/powerpc where it belongsBenjamin Herrenschmidt1-1/+21
The RMA (RMO is a misnomer) is a concept specific to ppc64 (in fact server ppc64 though I hijack it on embedded ppc64 for similar purposes) and represents the area of memory that can be accessed in real mode (aka with MMU off), or on embedded, from the exception vectors (which is bolted in the TLB) which pretty much boils down to the same thing. We take that out of the generic MEMBLOCK data structure and move it into arch/powerpc where it belongs, renaming it to "RMA" while at it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-05memblock: Introduce default allocation limit and use it to replace explicit onesBenjamin Herrenschmidt1-1/+2
This introduce memblock.current_limit which is used to limit allocations from memblock_alloc() or memblock_alloc_base(..., MEMBLOCK_ALLOC_ACCESSIBLE). The old MEMBLOCK_ALLOC_ANYWHERE changes value from 0 to ~(u64)0 and can still be used with memblock_alloc_base() to allocate really anywhere. It is -no-longer- cropped to MEMBLOCK_REAL_LIMIT which disappears. Note to archs: I'm leaving the default limit to MEMBLOCK_ALLOC_ANYWHERE. I strongly recommend that you ensure that you set an appropriate limit during boot in order to guarantee that an memblock_alloc() at any time results in something that is accessible with a simple __va(). The reason is that a subsequent patch will introduce the ability for the array to resize itself by reallocating itself. The MEMBLOCK core will honor the current limit when performing those allocations. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-05memblock: Expose MEMBLOCK_ALLOC_ANYWHEREBenjamin Herrenschmidt1-1/+1
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-04memblock/powerpc: Use new accessorsBenjamin Herrenschmidt1-4/+4
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-04memblock: Rename memblock_region to memblock_type and memblock_property to ↵Benjamin Herrenschmidt1-1/+1
memblock_region Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-07-23powerpc/mm: Add some debug output when hash insertion failsBenjamin Herrenschmidt1-5/+29
This adds some debug output to our MMU hash code to print out some useful debug data if the hypervisor refuses the insertion (which should normally never happen). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> ---
2010-07-23powerpc/mm: Move around testing of _PAGE_PRESENT in hash codeBenjamin Herrenschmidt1-8/+11
Instead of adding _PAGE_PRESENT to the access permission mask in each low level routine independently, we add it once from hash_page(). We also move the preliminary access check (the racy one before the PTE is locked) up so it applies to the huge page case. This duplicates code in __hash_page_huge() which we'll remove in a subsequent patch to fix a race in there. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-07-14lmb: rename to memblockYinghai Lu1-13/+13
via following scripts FILES=$(find * -type f | grep -vE 'oprofile|[^K]config') sed -i \ -e 's/lmb/memblock/g' \ -e 's/LMB/MEMBLOCK/g' \ $FILES for N in $(find . -name lmb.[ch]); do M=$(echo $N | sed 's/lmb/memblock/g') mv $N $M done and remove some wrong change like lmbench and dlmb etc. also move memblock.c from lib/ to mm/ Suggested-by: Ingo Molnar <mingo@elte.hu> Acked-by: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-12-18powerpc/mm: Fix stupid bug in subpge protection handlingDavid Gibson1-2/+2
Commit d28513bc7f675d28b479db666d572e078ecf182d ("Fix bug in pagetable cache cleanup with CONFIG_PPC_SUBPAGE_PROT"), itself a fix for breakage caused by an earlier clean up patch of mine, contains a stupid bug. I changed the parameters of the subpage_protection() function, but failed to update one of the callers. This patch fixes it, and replaces a void * with a typed pointer so that the compiler will warn on such an error in future. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-12-18powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.Sachin P. Sant1-4/+4
This time without the funny characters. Fix following build errors generated with DEBUG=1 cc1: warnings being treated as errors arch/powerpc/mm/hash_utils_64.c: In function 'htab_dt_scan_page_sizes': arch/powerpc/mm/hash_utils_64.c:343: error: format '%04x' expects type 'unsigned int', but argument 4 has type 'long unsigned int' arch/powerpc/mm/hash_utils_64.c:343: error: format '%08x' expects type 'unsigned int', but argument 5 has type 'long unsigned int' arch/powerpc/mm/hash_utils_64.c: In function 'htab_initialize': arch/powerpc/mm/hash_utils_64.c:666: error: format '%x' expects type 'unsigned int', but argument 4 has type 'long unsigned int' ... SNIP ... Signed-off-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-12-08powerpc/mm: Fix pgtable cache cleanup with CONFIG_PPC_SUBPAGE_PROTDavid Gibson1-3/+3
Commit a0668cdc154e54bf0c85182e0535eea237d53146 cleans up the handling of kmem_caches for allocating various levels of pagetables. Unfortunately, it conflicts badly with CONFIG_PPC_SUBPAGE_PROT, due to the latter's cleverly hidden technique of adding some extra allocation space to the top level page directory to store the extra information it needs. Since that extra allocation really doesn't fit into the cleaned up page directory allocating scheme, this patch alters CONFIG_PPC_SUBPAGE_PROT to instead allocate its struct subpage_prot_table as part of the mm_context_t. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-12-02Revert "powerpc/mm: Fix bug in pagetable cache cleanup with ↵Benjamin Herrenschmidt1-3/+3
CONFIG_PPC_SUBPAGE_PROT" This reverts commit c045256d146800ea1d741a8e9e377dada6b7e195. It breaks build when CONFIG_PPC_SUBPAGE_PROT is not set. I will commit a fixed version separately Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-11-27powerpc/mm: Fix bug in pagetable cache cleanup with CONFIG_PPC_SUBPAGE_PROTDavid Gibson1-3/+3
Commit a0668cdc154e54bf0c85182e0535eea237d53146 cleans up the handling of kmem_caches for allocating various levels of pagetables. Unfortunately, it conflicts badly with CONFIG_PPC_SUBPAGE_PROT, due to the latter's cleverly hidden technique of adding some extra allocation space to the top level page directory to store the extra information it needs. Since that extra allocation really doesn't fit into the cleaned up page directory allocating scheme, this patch alters CONFIG_PPC_SUBPAGE_PROT to instead allocate its struct subpage_prot_table as part of the mm_context_t. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-11-05Export symbols for KVM moduleAlexander Graf1-0/+2
We want to be able to build KVM as a module. To enable us doing so, we need some more exports from core Linux parts. This patch exports all functions and variables that are required for KVM. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>