aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2005-06-23[PATCH] remove redundant vm_flags clearing from madvise.cPekka Enberg1-1/+0
This patch removes redundant VM_ClearReadHint from mm/madvice.c which was left there by Prasanna's patch. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] create a kstrdup library functionPaulo Marques1-0/+24
This patch creates a new kstrdup library function and changes the "local" implementations in several places to use this function. Most of the changes come from the sound and net subsystems. The sound part had already been acknowledged by Takashi Iwai and the net part by David S. Miller. I left UML alone for now because I would need more time to read the code carefully before making changes there. Signed-off-by: Paulo Marques <pmarques@grupopie.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] NUMA aware block device control structure allocationChristoph Lameter1-5/+12
Patch to allocate the control structures for for ide devices on the node of the device itself (for NUMA systems). The patch depends on the Slab API change patch by Manfred and me (in mm) and the pcidev_to_node patch that I posted today. Does some realignment too. Signed-off-by: Justin M. Forbes <jmforbes@linuxtx.org> Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Pravin Shelar <pravin@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem hotplug baseAndy Whitcroft2-22/+74
Make sparse's initalization be accessible at runtime. This allows sparse mappings to be created after boot in a hotplug situation. This patch is separated from the previous one just to give an indication how much of the sparse infrastructure is *just* for hotplug memory. The section_mem_map doesn't really store a pointer. It stores something that is convenient to do some math against to get a pointer. It isn't valid to just do *section_mem_map, so I don't think it should be stored as a pointer. There are a couple of things I'd like to store about a section. First of all, the fact that it is !NULL does not mean that it is present. There could be such a combination where section_mem_map *is* NULL, but the math gets you properly to a real mem_map. So, I don't think that check is safe. Since we're storing 32-bit-aligned structures, we have a few bits in the bottom of the pointer to play with. Use one bit to encode whether there's really a mem_map there, and the other one to tell whether there's a valid section there. We need to distinguish between the two because sometimes there's a gap between when a section is discovered to be present and when we can get the mem_map for it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem swiss cheese numa layoutsAndy Whitcroft1-0/+2
The part of the sparsemem patch which modifies memmap_init_zone() has recently become a problem. It changes behavior so that there is a call to pfn_to_page() for each individual page inside of a node's range: node_start_pfn through node_end_pfn. It used to simply do this once, at the beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside of a node made it necessary to change. Mike Kravetz recently wrote a patch which made the NUMA code accept some new kinds of layouts. The system's memory was laid out like this, with node 0's memory in two pieces: one before and one after node 1's memory: Node 0: +++++ +++++ Node 1: +++++ Previous behavior before Mike's patch was to assign nodes like this: Node 0: 00000 XXXXX Node 1: 11111 Where the 'X' areas were simply thrown away. The new behavior was to make the pg_data_t span node 0 across all of its areas, including areas that are really node 1's: Node 0: 000000000000000 Node 1: 11111 This wastes a little bit of mem_map space, but ends up being OK, and more fully utilizes the system's memory. memmap_init_zone() initializes all of the "struct page"s for node 0, even for the "hole", but those never get used, because there is no pfn_to_page() that resolves to those pages. However, only calling pfn_to_page() once, memmap_init_zone() always uses the pages that were allocated for node0->node_mem_map because: struct page *start = pfn_to_page(start_pfn); // effectively start = &node->node_mem_map[0] for (page = start; page < (start + size); page++) { init_page_here();... page++; } Slow, and wasteful, but generally harmless. But, modify that to call pfn_to_page() for each loop iteration (like sparsemem does): for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) { page = pfn_to_page(pfn); } And you end up trying to initialize node 1's pages too early, along with bogus data from node 0. This patch checks for those weird layouts and declines to touch the pages, making the more frequent pfn_to_page() calls OK to do. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem memory modelAndy Whitcroft6-15/+159
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of mem_map[] is needed by discontiguous memory machines (like in the old CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually become a complete replacement. A significant advantage over DISCONTIGMEM is that it's completely separated from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA and DISCONTIG are often confused. Another advantage is that sparse doesn't require each NUMA node's ranges to be contiguous. It can handle overlapping ranges between nodes with no problems, where DISCONTIGMEM currently throws away that memory. Sparsemem uses an array to provide different pfn_to_page() translations for each SECTION_SIZE area of physical memory. This is what allows the mem_map[] to be chopped up. In order to do quick pfn_to_page() operations, the section number of the page is encoded in page->flags. Part of the sparsemem infrastructure enables sharing of these bits more dynamically (at compile-time) between the page_zone() and sparsemem operations. However, on 32-bit architectures, the number of bits is quite limited, and may require growing the size of the page->flags type in certain conditions. Several things might force this to occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of memory), an increase in the physical address space, or an increase in the number of used page->flags. One thing to note is that, once sparsemem is present, the NUMA node information no longer needs to be stored in the page->flags. It might provide speed increases on certain platforms and will be stored there if there is room. But, if out of room, an alternate (theoretically slower) mechanism is used. This patch introduces CONFIG_FLATMEM. It is used in almost all cases where there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM often have to compile out the same areas of code. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] generify memory presentAndy Whitcroft1-0/+4
Allow architectures to indicate that they will be providing hooks to indice installed memory areas, memory_present(). Provide prototypes for the i386 implementation. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] mm/Kconfig: give DISCONTIG more help textDave Hansen1-0/+10
This gives DISCONTIGMEM a bit more help text to explain what it does, not just when to choose it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] mm/Kconfig: hide "Memory Model" selection menuDave Hansen1-4/+17
I got some feedback from users who think that the new "Memory Model" menu is a little invasive. This patch will hide that menu, except when CONFIG_EXPERIMENTAL is enabled *or* when an individual architecture wants it. An individual arch may want to enable it because they've removed their arch-specific DISCONTIG prompt in favor of the mm/Kconfig one. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem: fix minor "defaults" issue in mm/KconfigDave Hansen1-2/+1
The following patch applies on top of 2.6.12-rc2-mm1. It fixes a minor user interaction issue, and an early reference to SPARSEMEM. This "choice" menu would always default to FLATMEM, as it was listed first. Move it to the end so that the other defaults have a chance first. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] Introduce new Kconfig option for NUMA or DISCONTIGDave Hansen2-3/+11
There is some confusion that arose when working on SPARSEMEM patch between what is needed for DISCONTIG vs. NUMA. Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently. All of the current NUMA implementations require an implementation of DISCONTIG. Because of this, quite a lot of code which is really needed for NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n case. Introducing this new NEED_MULTIPLE_NODES config option allows code that is needed for both NUMA or DISCONTIG to be separated out from code that is specific to DISCONTIG. One great advantage of this approach is that it doesn't require every architecture to be converted over. All of the current implementations should "just work", only the ones implementing SPARSEMEM will have to be fixed up. The change to free_area_init() makes it work inside, or out of the new config option. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] create mm/Kconfig for arch-independent memory optionsDave Hansen1-0/+25
With sparsemem being introduced, we need a central place for new memory-related .config options: mm/Kconfig. This allows us to remove many of the duplicated arch-specific options. The new option, CONFIG_FLATMEM, is there to enable us to detangle NUMA and DISCONTIGMEM. This is a requirement for sparsemem because sparsemem uses the NUMA code without the presence of DISCONTIGMEM. The sparsemem patches use CONFIG_FLATMEM in generic code, so this patch is a requirement before applying them. Almost all places that used to do '#ifndef CONFIG_DISCONTIGMEM' should use '#ifdef CONFIG_FLATMEM' instead. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem base: reorganize page->flags bit operationsDave Hansen1-1/+1
Generify the value fields in the page_flags. The aim is to allow the location and size of these fields to be varied. Additionally we want to move away from fixed allocations per field whilst still enforcing the overall bit utilisation limits. We rely on the compiler to spot and optimise the accessor functions. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem base: simple NUMA remap space allocatorDave Hansen1-1/+5
Introduce a simple allocator for the NUMA remap space. This space is very scarce, used for structures which are best allocated node local. This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep the pgdat->node_mem_map initialized in a consistent place for all architectures. Issues: o alloc_remap takes a node_id where we might expect a pgdat which was intended to allow us to allocate the pgdat's using this mechanism; which we do not yet do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22[PATCH] boot_pageset must not be freed.Christoph Lameter1-2/+9
The boot_pageset needs to be preserved for hotplugging and for off line processors and nodes. Otherwise pointers will point into memory that has now a different use. /proc/zoneinfo is currently showing strange results if processors / nodes are not present. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Kill stray newlineDenis Vlasenko1-1/+1
OOM killer prints a stray newline. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] msync: check pte dirty earlierAbhijit Karmarkar1-0/+2
It's common practice to msync a large address range regularly, in which often only a few ptes have actually been dirtied since the previous pass. sync_pte_range then goes much faster if it tests whether pte is dirty before locating and accessing each struct page cacheline; and it is hardly slowed by ptep_clear_flush_dirty repeating that test in the opposite case, when every pte actually is dirty. But beware, s390's pte_dirty always says false, since its dirty bit is kept in the storage key, located via the struct page address. So skip this optimization in its case: use a pte_maybe_dirty macro which just says true if page_test_and_clear_dirty is implemented. Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] can_share_swap_page: use page_mapcountHugh Dickins3-65/+21
Remember that ironic get_user_pages race? when the raised page_count on a page swapped out led do_wp_page to decide that it had to copy on write, so substituted a different page into userspace. 2.6.7 onwards have Andrea's solution, where try_to_unmap_one backs out if it finds page_count raised. Which works, but is unsatisfying (rmap.c has no other page_count heuristics), and was found a few months ago to hang an intensive page migration test. A year ago I was hesitant to engage page_mapcount, now it seems the right fix. So remove the page_count hack from try_to_unmap_one; and use activate_page in unuse_mm when dropping lock, to replace its secondary effect of helping swapoff to make progress in that case. Simplify can_share_swap_page (now called only on anonymous pages) to check page_mapcount + page_swapcount == 1: still needs the page lock to stabilize their (pessimistic) sum, but does not need swapper_space.tree_lock for that. In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to keep sum on the high side, and correct when can_share_swap_page called. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] do_wp_page: cannot share file pageHugh Dickins1-1/+1
A small optimization to do_wp_page's check for whether to avoid copy by reusing the page already mapped. It can never share a cached file page, nor can it share a reserved page (often the empty zero page), so it's a waste of time to lock and unlock in those cases. Which nowadays can both be neatly excluded by a preliminary PageAnon test. Christoph has reported that a preliminary page_count test proved valuable for scalability here, but PageAnon covers more common cases all at once. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] get_user_pages: kill get_page_mapHugh Dickins1-35/+10
Since its birth, get_user_pages has been calling a misguided get_page_map function. follow_page has already returned NULL if the pfn is invalid, we cannot reach an invalid pfn from a validated struct page. Remove get_page_map, and the messy rewind in get_user_pages to cope with its failure. Oh, and could we please call that "struct page *page" like everywhere else, instead of "struct page *map"? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] bad_page: clear reclaim and slabHugh Dickins1-5/+10
Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page ought to clear them to avoid repetitive reports (Nikita noticed this too). Let prep_new_page check page_count and PG_slab as free_pages_check does. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] mbind: check_range use standard ptwalkHugh Dickins1-45/+70
Strict mbind's check for currently mapped pages being on node has been using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry: replace that by a standard four-level page table walk like others in mm. Since mmap_sem is held for writing, page_table_lock can be taken at the inner level to limit latency. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] mbind: fix verify_pages pte_pageHugh Dickins1-5/+14
Strict mbind's check that pages already mapped are on right node has been using pte_page without checking if pfn_valid, and without page_table_lock to prevent spurious failures when try_to_unmap_one intervenes between the pte_present and the pte_page. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] shmem: restore superblock infoHugh Dickins1-73/+70
To improve shmem scalability, we allowed tmpfs instances which don't need their blocks or inodes limited not to count them, and not to allocate any sbinfo. Which was okay when the only use for the sbinfo was accounting blocks and inodes; but since then a couple of unrelated projects extending tmpfs want to store other data in the sbinfo. Whether either extension reaches mainline is beside the point: I'm guilty of a bad design decision, and should restore sbinfo to make any such future extensions easier. So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited inodes. Brent Casavant verified (many months ago) that this does not perceptibly impact the scalability (since the unlimited sbinfo cacheline is repeatedly accessed but only once dirtied). And merge shmem_set_size into its sole caller shmem_remount_fs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Reduce size of huge boot per_cpu_pagesetChristoph Lameter1-66/+42
Reduce size of the huge per_cpu_pageset structure in __initdata introduced into mm1 with the pageset localization patchset. Use one specially configured pageset per cpu for all zones and nodes during bootup. - Avoid duplication of pageset initialization code. - do the adding to the pageset list before potential free_pages_bulk in free_hot_cold_page (otherwise we would have to hold a page in a pageset during the period that the boot pagesets are in use). - remove mistaken __cpuinitdata attribute and revert back to __initdata for the boot pageset. A boot pageset is not necessary for cpu hotplug. Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of sparsemem patches) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Periodically drain non local pagesetsChristoph Lameter2-2/+34
The pageset array can potentially acquire a huge amount of memory on large NUMA systems. F.e. on a system with 512 processors and 256 nodes there will be 256*512 pagesets. If each pageset only holds 5 pages then we are talking about 655360 pages.With a 16K page size on IA64 this results in potentially 10 Gigabytes of memory being trapped in pagesets. The typical cases are much less for smaller systems but there is still the potential of memory being trapped in off node pagesets. Off node memory may be rarely used if local memory is available and so we may potentially have memory in seldom used pagesets without this patch. The slab allocator flushes its per cpu caches every 2 seconds. The following patch flushes the off node pageset caches in the same way by tying into the slab flush. The patch also changes /proc/zoneinfo to include the number of pages currently in each pageset. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] add OOM debugJanet Morgan2-3/+5
This patch provides more debug info when the system is OOM. It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill. Thanks to Dave Jones for coming up with the idea. Signed-off-by: Janet Morgan <janetmor@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] __read_page_state(): pass unsigned long instead of unsignedBenjamin LaHaise1-1/+1
By making the offset argument of __read_page_state an unsigned long instead of unsigned, we can avoid forcing the compiler to sign extend a usually constant argument. This saves 1 instruction on x86-64. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] __mod_page_state(): pass unsigned long instead of unsignedBenjamin LaHaise1-1/+1
By making the offset argument of __mod_page_state an unsigned long instead of unsigned, we can avoid forcing the compiler to sign extend a usually constant argument. This saves 1 instruction on x86-64. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] vm: try_to_free_pages unused argumentDarren Hart2-3/+2
try_to_free_pages accepts a third argument, order, but hasn't used it since before 2.6.0. The following patch removes the argument and updates all the calls to try_to_free_pages. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] mmap topdown fix for large stack limit, large allocationChris Wright1-0/+4
The topdown changes in 2.6.12-rc1 can cause large allocations with large stack limit to fail, despite there being space available. The mmap_base-len is only valid when len >= mmap_base. However, nothing in topdown allocator checks this. It's only (now) caught at higher level, which will cause allocation to simply fail. The following change restores the fallback to bottom-up path, which will allow large allocations with large stack limit to potentially still succeed. Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Avoiding mmap fragmentationWolfgang Wander2-14/+41
Ingo recently introduced a great speedup for allocating new mmaps using the free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and causes huge performance increases in thread creation. The downside of this patch is that it does lead to fragmentation in the mmap-ed areas (visible via /proc/self/maps), such that some applications that work fine under 2.4 kernels quickly run out of memory on any 2.6 kernel. The problem is twofold: 1) the free_area_cache is used to continue a search for memory where the last search ended. Before the change new areas were always searched from the base address on. So now new small areas are cluttering holes of all sizes throughout the whole mmap-able region whereas before small holes tended to close holes near the base leaving holes far from the base large and available for larger requests. 2) the free_area_cache also is set to the location of the last munmap-ed area so in scenarios where we allocate e.g. five regions of 1K each, then free regions 4 2 3 in this order the next request for 1K will be placed in the position of the old region 3, whereas before we appended it to the still active region 1, placing it at the location of the old region 2. Before we had 1 free region of 2K, now we only get two free regions of 1K -> fragmentation. The patch addresses thes issues by introducing yet another cache descriptor cached_hole_size that contains the largest known hole size below the current free_area_cache. If a new request comes in the size is compared against the cached_hole_size and if the request can be filled with a hole below free_area_cache the search is started from the base instead. The results look promising: Whereas 2.6.12-rc4 fragments quickly and my (earlier posted) leakme.c test program terminates after 50000+ iterations with 96 distinct and fragmented maps in /proc/self/maps it performs nicely (as expected) with thread creation, Ingo's test_str02 with 20000 threads requires 0.7s system time. Taking out Ingo's patch (un-patch available per request) by basically deleting all mentions of free_area_cache from the kernel and starting the search for new memory always at the respective bases we observe: leakme terminates successfully with 11 distinctive hardly fragmented areas in /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system time for Ingo's test_str02 with 20000 threads. Now - drumroll ;-) the appended patch works fine with leakme: it ends with only 7 distinct areas in /proc/self/maps and also thread creation seems sufficiently fast with 0.71s for 20000 threads. Signed-off-by: Wolfgang Wander <wwc@rentec.com> Credit-to: "Richard Purdie" <rpurdie@rpsys.net> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> (partly) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] node local per-cpu-pagesChristoph Lameter2-36/+177
This patch modifies the way pagesets in struct zone are managed. Each zone has a per-cpu array of pagesets. So any particular CPU has some memory in each zone structure which belongs to itself. Even if that CPU is not local to that zone. So the patch relocates the pagesets for each cpu to the node that is nearest to the cpu instead of allocating the pagesets in the (possibly remote) target zone. This means that the operations to manage pages on remote zone can be done with information available locally. We play a macro trick so that non-NUMA pmachines avoid the additional pointer chase on the page allocator fastpath. AIM7 benchmark on a 32 CPU SGI Altix w/o patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.68 100 484.6769 12.01 1.97 Fri Mar 25 11:01:42 2005 100 27140.46 89 271.4046 21.44 148.71 Fri Mar 25 11:02:04 2005 200 30792.02 82 153.9601 37.80 296.72 Fri Mar 25 11:02:42 2005 300 32209.27 81 107.3642 54.21 451.34 Fri Mar 25 11:03:37 2005 400 34962.83 78 87.4071 66.59 588.97 Fri Mar 25 11:04:44 2005 500 31676.92 75 63.3538 91.87 742.71 Fri Mar 25 11:06:16 2005 600 36032.69 73 60.0545 96.91 885.44 Fri Mar 25 11:07:54 2005 700 35540.43 77 50.7720 114.63 1024.28 Fri Mar 25 11:09:49 2005 800 33906.70 74 42.3834 137.32 1181.65 Fri Mar 25 11:12:06 2005 900 34120.67 73 37.9119 153.51 1325.26 Fri Mar 25 11:14:41 2005 1000 34802.37 74 34.8024 167.23 1465.26 Fri Mar 25 11:17:28 2005 with slab API changes and pageset patch: Tasks jobs/min jti jobs/min/task real cpu 1 485.00 100 485.0000 12.00 1.96 Fri Mar 25 11:46:18 2005 100 28000.96 89 280.0096 20.79 150.45 Fri Mar 25 11:46:39 2005 200 32285.80 79 161.4290 36.05 293.37 Fri Mar 25 11:47:16 2005 300 40424.15 84 134.7472 43.19 438.42 Fri Mar 25 11:47:59 2005 400 39155.01 79 97.8875 59.46 590.05 Fri Mar 25 11:48:59 2005 500 37881.25 82 75.7625 76.82 730.19 Fri Mar 25 11:50:16 2005 600 39083.14 78 65.1386 89.35 872.79 Fri Mar 25 11:51:46 2005 700 38627.83 77 55.1826 105.47 1022.46 Fri Mar 25 11:53:32 2005 800 39631.94 78 49.5399 117.48 1169.94 Fri Mar 25 11:55:30 2005 900 36903.70 79 41.0041 141.94 1310.78 Fri Mar 25 11:57:53 2005 1000 36201.23 77 36.2012 160.77 1458.31 Fri Mar 25 12:00:34 2005 Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Hugepage consolidationDavid Gibson1-1/+176
A lot of the code in arch/*/mm/hugetlbpage.c is quite similar. This patch attempts to consolidate a lot of the code across the arch's, putting the combined version in mm/hugetlb.c. There are a couple of uglyish hacks in order to covert all the hugepage archs, but the result is a very large reduction in the total amount of code. It also means things like hugepage lazy allocation could be implemented in one place, instead of six. Tested, at least a little, on ppc64, i386 and x86_64. Notes: - this patch changes the meaning of set_huge_pte() to be more analagous to set_pte() - does SH4 need s special huge_ptep_get_and_clear()?? Acked-by: William Lee Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] VM: rate limit early reclaimMartin Hicks2-0/+11
When early zone reclaim is turned on the LRU is scanned more frequently when a zone is low on memory. This limits when the zone reclaim can be called by skipping the scan if another thread (either via kswapd or sync reclaim) is already reclaiming from the zone. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] VM: add __GFP_NORECLAIMMartin Hicks1-0/+2
When using the early zone reclaim, it was noticed that allocating new pages that should be spread across the whole system caused eviction of local pages. This adds a new GFP flag to prevent early reclaim from happening during certain allocation attempts. The example that is implemented here is for page cache pages. We want page cache pages to be spread across the whole system, and we don't want page cache pages to evict other pages to get local memory. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] VM: early zone reclaimMartin Hicks2-5/+92
This is the core of the (much simplified) early reclaim. The goal of this patch is to reclaim some easily-freed pages from a zone before falling back onto another zone. One of the major uses of this is NUMA machines. With the default allocator behavior the allocator would look for memory in another zone, which might be off-node, before trying to reclaim from the current zone. This adds a zone tuneable to enable early zone reclaim. It is selected on a per-zone basis and is turned on/off via syscall. Adding some extra throttling on the reclaim was also required (patch 4/4). Without the machine would grind to a crawl when doing a "make -j" kernel build. Even with this patch the System Time is higher on average, but it seems tolerable. Here are some numbers for kernbench runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run: wall user sys %cpu ctx sw. sleeps ---- ---- --- ---- ------ ------ No patch 1009 1384 847 258 298170 504402 w/patch, no reclaim 880 1376 667 288 254064 396745 w/patch & reclaim 1079 1385 926 252 291625 548873 These numbers are the average of 2 runs of 3 "make -j" runs done right after system boot. Run-to-run variability for "make -j" is huge, so these numbers aren't terribly useful except to seee that with reclaim the benchmark still finishes in a reasonable amount of time. I also looked at the NUMA hit/miss stats for the "make -j" runs and the reclaim doesn't make any difference when the machine is thrashing away. Doing a "make -j8" on a single node that is filled with page cache pages takes 700 seconds with reclaim turned on and 735 seconds without reclaim (due to remote memory accesses). The simple zone_reclaim syscall program is at http://www.bork.org/~mort/sgi/zone_reclaim.c Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] VM: add may_swap flag to scan_controlMartin Hicks1-1/+6
Here's the next round of these patches. These are totally different in an attempt to meet the "simpler" request after the last patches. For reference the earlier threads are: http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2 http://marc.theaimsgroup.com/?l=linux-mm&m=111461480721249&w=2 This set of patches replaces my other vm- patches that are currently in -mm. So they're against 2.6.12-rc5-mm1 about half way through the -mm patchset. As I said already this patch is a lot simpler. The reclaim is turned on or off on a per-zone basis using a syscall. I haven't tested the x86 syscall, so it might be wrong. It uses the existing reclaim/pageout code with the small addition of a may_swap flag to scan_control (patch 1/4). I also added __GFP_NORECLAIM (patch 3/4) so that certain allocation types can be flagged to never cause reclaim. This was a deficiency that was in all of my earlier patch sets. Previously, doing a big buffered read would fill one zone with page cache and then start to reclaim from that same zone, leaving the other zones untouched. Adding some extra throttling on the reclaim was also required (patch 4/4). Without the machine would grind to a crawl when doing a "make -j" kernel build. Even with this patch the System Time is higher on average, but it seems tolerable. Here are some numbers for kernbench runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run: wall user sys %cpu ctx sw. sleeps ---- ---- --- ---- ------ ------ No patch 1009 1384 847 258 298170 504402 w/patch, no reclaim 880 1376 667 288 254064 396745 w/patch & reclaim 1079 1385 926 252 291625 548873 These numbers are the average of 2 runs of 3 "make -j" runs done right after system boot. Run-to-run variability for "make -j" is huge, so these numbers aren't terribly useful except to seee that with reclaim the benchmark still finishes in a reasonable amount of time. I also looked at the NUMA hit/miss stats for the "make -j" runs and the reclaim doesn't make any difference when the machine is thrashing away. Doing a "make -j8" on a single node that is filled with page cache pages takes 700 seconds with reclaim turned on and 735 seconds without reclaim (due to remote memory accesses). The simple zone_reclaim syscall program is at http://www.bork.org/~mort/sgi/zone_reclaim.c This patch: This adds an extra switch to the scan_control struct. It simply lets the reclaim code know if its allowed to swap pages out. This was required for a simple per-zone reclaimer. Without this addition pages would be swapped out as soon as a zone ran out of memory and the early reclaim kicked in. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] mm: add /proc/zoneinfoNikita Danilov1-2/+111
Add /proc/zoneinfo file to display information about memory zones. Useful to analyze VM behaviour. Signed-off-by: Nikita Danilov <nikita@clusterfs.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] madvise: merge the mapsPrasanna Meda1-29/+51
This attempts to merge back the split maps. This code is mostly copied from Chrisw's mlock merging from post 2.6.11 trees. The only difference is in munmapped_error handling. Also passed prev to willneed/dontneed, eventhogh they do not handle it now, since I felt it will be cleaner, instead of handling prev in madvise_vma in some cases and in subfunction in some cases. Signed-off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] madvise: do not split the mapsPrasanna Meda1-11/+16
This attempts to avoid splittings when it is not needed, that is when vm_flags are same as new flags. The idea is from the <2.6.11 mlock_fixup and others. This will provide base for the next madvise merging patch. Signed-off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] vmscan: notice slab shrinkingakpm@osdl.org1-5/+14
Fix a problem identified by Andrea Arcangeli <andrea@suse.de> kswapd will set a zone into all_unreclaimable state if it sees that we're not successfully reclaiming LRU pages. But that fails to notice that we're successfully reclaiming slab obects, so we can set all_unreclaimable too soon. So change shrink_slab() to return a success indication if it actually reclaimed some objects, and don't assume that the zone is all_unreclaimable if that is true. This means that we won't enter all_unreclaimable state if we are successfully freeing slab objects but we're not yet actually freeing slab pages, due to internal fragmentation. (hm, this has a shortcoming. We could be successfully freeing ZONE_NORMAL slab objects while being really oom on ZONE_DMA. If that happens then kswapd might burn a lot of CPU. But given that there might be some slab objects in ZONE_DMA, perhaps that is appropriate.) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-18[SLAB] Introduce kmem_cache_nameArnaldo Carvalho de Melo1-0/+6
This is for use with slab users that pass a dynamically allocated slab name in kmem_cache_create, so that before destroying the slab one can retrieve the name and free its memory. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-06[PATCH] broken fault_in_pages_readable call in generic_file_buffered_write()Martin Schwidefsky1-1/+7
fault_in_pages_readable() is being passed an incorrect `end' address, which can result in writes accidentally faulting in pages which will not be affected by the write() call. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-24[PATCH] try_to_unmap_cluster() passes out-of-bounds pte to pte_unmap()William Lee Irwin III1-3/+3
try_to_unmap_cluster() does: for (pte = pte_offset_map(pmd, address); address < end; pte++, address += PAGE_SIZE) { ... } pte_unmap(pte); It may take a little staring to notice, but pte can actually fall off the end of the pte page in this iteration, which makes life difficult for kmap_atomic() and the users not expecting it to BUG(). Of course, we're somewhat lucky in that arithmetic elsewhere in the function guarantees that at least one iteration is made, lest this force larger rearrangements to be made. This issue and patch also apply to non-mm mainline and with trivial adjustments, at least two related kernels. Discovered during internal testing at Oracle. Signed-off-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-21[PATCH] fix for __generic_file_aio_read() to return 0 on EOFSuparna Bhattacharya1-1/+1
I came across the following problem while running ltp-aiodio testcases from ltp-full-20050405 on linux-2.6.12-rc3-mm3. I tried running the tests with EXT3 as well as JFS filesystems. One or two fsx-linux testcases were hung after some time. These testcases were hanging at wait_for_all_aios(). Debugging shows that there were some iocbs which were not getting completed eventhough the last retry for those returned -EIOCBQUEUED. Also all such pending iocbs represented READ operation. Further debugging revealed that all such iocbs hit EOF in the DIO layer. To be more precise, the "pos" from which they were trying to read was greater than the "size" of the file. So the generic_file_direct_IO returned 0. This happens rarely as there is already a check in __generic_file_aio_read(), for whether "pos" < "size" before calling direct IO routine. >size = i_size_read(inode); >if (pos < size) { > retval = generic_file_direct_IO(READ, iocb, > iov, pos, nr_segs); But for READ, we are taking the inode->i_sem only in the DIO layer. So it is possible that some other process can change the size of the file before we take the i_sem. In such a case ( when "pos" > "size"), the __generic_file_aio_read() would return -EIOCBQUEUED even though there were no I/O requests submitted by the DIO layer. This would cause the AIO layer to expect aio_complete() for THE iocb, which doesnot happen. And thus the test hangs forever, waiting for an I/O completion, where there are no requests submitted at all. The following patch makes __generic_file_aio_read() return 0 (instead of returning -EIOCBQUEUED), on getting 0 from generic_file_direct_IO(), so that the AIO layer does the aio_complete(). Testing: I have tested the patch on a SMP machine(with 2 Pentium 4 (HT)) running linux-2.6.12-rc3-mm3. I ran the ltp-aiodio testcases and none of the fsx-linux tests hung. Also the aio-stress tests ran without any problem. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-20[PATCH] x86_64: Fixed guard page handling again in iounmapAndi Kleen1-13/+20
Caused oopses again. Also fix potential mismatch in checking if change_page_attr was needed. To do it without races I needed to change mm/vmalloc.c to export a __remove_vm_area that does not take vmlist lock. Noticed by Terence Ripperda and based on a patch of his. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-19Fix get_unmapped_area sanity testsLinus Torvalds1-28/+31
As noted by Chris Wright, we need to do the full range of tests regardless of whether MAP_FIXED is set or not, so re-organize get_unmapped_area() slightly to do the sanity checks unconditionally.
2005-05-19[PATCH] prevent NULL mmap in topdown modelLinus Torvalds1-2/+2
Prevent the topdown allocator from allocating mmap areas all the way down to address zero. We still allow a MAP_FIXED mapping of page 0 (needed for various things, ranging from Wine and DOSEMU to people who want to allow speculative loads off a NULL pointer). Tested by Chris Wright. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17[PATCH] do_swap_page() can map random data if swap read failsKirill Korotaev1-5/+12
There is a bug in do_swap_page(): when swap page happens to be unreadable, page filled with random data is mapped into user address space. The fix is to check for PageUptodate and send SIGBUS in case of error. Signed-Off-By: Kirill Korotaev <dev@sw.ru> Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17[PATCH] swapout oops fixMcMullan, Jason1-1/+1
Fix OOPS when swapping on a device that doesn't have an unplug_io_fn defined (eg, ATA Over Ethernet) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17[PATCH] mm/nommu.c: try to fix __vmallocAdrian Bunk1-1/+2
Linus changed the second argument of __vmalloc from int to unsigned int breaking the compilation for CONFIG_MMU=n configurations (since he only changed vmalloc.c but not nommu.c). Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17[PATCH] mm acct accounting fixKirill Korotaev1-1/+6
This patch fixes mm->total_vm and mm->locked_vm acctounting in case when move_page_tables() fails inside move_vma(). Signed-Off-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17[PATCH] mm: fix rss counter being incremented when unmappingBjorn Steinbrink1-1/+1
This patch fixes a bug introduced by the "mm counter operations through macros" patch, which replaced a decrement operation in with an increment macro in try_to_unmap_one(). Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05[PATCH] remove outdated comments from filemap.cChristoph Hellwig1-5/+0
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-03[IA64] Export node_online_map and node_possible_mapDean Nelson1-0/+2
Export node_online_map and node_possible_map so that kernel modules can use the nodemask macros, like, for_each_node() and for_each_online_node(). Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-05-01[PATCH] DocBook: fix some descriptionsMartin Waitz3-13/+14
Some KernelDoc descriptions are updated to match the current code. No code changes. Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] DocBook: changes and extensions to the kernel documentationPavel Pisa2-6/+5
I have recompiled Linux kernel 2.6.11.5 documentation for me and our university students again. The documentation could be extended for more sources which are equipped by structured comments for recent 2.6 kernels. I have tried to proceed with that task. I have done that more times from 2.6.0 time and it gets boring to do same changes again and again. Linux kernel compiles after changes for i386 and ARM targets. I have added references to some more files into kernel-api book, I have added some section names as well. So please, check that changes do not break something and that categories are not too much skewed. I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved by kernel convention. Most of the other changes are modifications in the comments to make kernel-doc happy, accept some parameters description and do not bail out on errors. Changed <pid> to @pid in the description, moved some #ifdef before comments to correct function to comments bindings, etc. You can see result of the modified documentation build at http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz Some more sources are ready to be included into kernel-doc generated documentation. Sources has been added into kernel-api for now. Some more section names added and probably some more chaos introduced as result of quick cleanup work. Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] Change synchronize_kernel to _rcu and _schedPaul E. McKenney1-1/+1
This patch changes calls to synchronize_kernel(), deprecated in the earlier "Deprecate synchronize_kernel, GPL replacement" patch to instead call the new synchronize_rcu() and synchronize_sched() APIs. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] Exterminate PAGE_BUGMatt Mackall1-2/+1
Remove PAGE_BUG - repalce it with BUG and BUG_ON. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] use smp_mb/wmb/rmb where possibleakpm@osdl.org1-2/+2
Replace a number of memory barriers with smp_ variants. This means we won't take the unnecessary hit on UP machines. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] add kmalloc_node, inline cleanupManfred Spraul1-14/+31
The patch makes the following function calls available to allocate memory on a specific node without changing the basic operation of the slab allocator: kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int flags, int node); kmalloc_node(size_t size, unsigned int flags, int node); in a similar way to the existing node-blind functions: kmem_cache_alloc(kmem_cache_t *cachep, unsigned int flags); kmalloc(size, flags); kmem_cache_alloc_node was changed to pass flags and the node information through the existing layers of the slab allocator (which lead to some minor rearrangements). The functions at the lowest layer (kmem_getpages, cache_grow) are already node aware. Also __alloc_percpu can call kmalloc_node now. Performance measurements (using the pageset localization patch) yields: w/o patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.97 Wed Mar 30 20:50:43 2005 100 25170.83 91 251.7083 23.12 150.10 Wed Mar 30 20:51:06 2005 200 34601.66 84 173.0083 33.64 294.14 Wed Mar 30 20:51:40 2005 300 37154.47 86 123.8482 46.99 436.56 Wed Mar 30 20:52:28 2005 400 39839.82 80 99.5995 58.43 580.46 Wed Mar 30 20:53:27 2005 500 40036.32 79 80.0726 72.68 728.60 Wed Mar 30 20:54:40 2005 600 44074.21 79 73.4570 79.23 872.10 Wed Mar 30 20:55:59 2005 700 44016.60 78 62.8809 92.56 1015.84 Wed Mar 30 20:57:32 2005 800 40411.05 80 50.5138 115.22 1161.13 Wed Mar 30 20:59:28 2005 900 42298.56 79 46.9984 123.83 1303.42 Wed Mar 30 21:01:33 2005 1000 40955.05 80 40.9551 142.11 1441.92 Wed Mar 30 21:03:55 2005 with pageset localization and slab API patches: Tasks jobs/min jti jobs/min/task real cpu 1 484.19 100 484.1930 12.02 1.98 Wed Mar 30 21:10:18 2005 100 27428.25 92 274.2825 21.22 149.79 Wed Mar 30 21:10:40 2005 200 37228.94 86 186.1447 31.27 293.49 Wed Mar 30 21:11:12 2005 300 41725.42 85 139.0847 41.84 434.10 Wed Mar 30 21:11:54 2005 400 43032.22 82 107.5805 54.10 582.06 Wed Mar 30 21:12:48 2005 500 42211.23 83 84.4225 68.94 722.61 Wed Mar 30 21:13:58 2005 600 40084.49 82 66.8075 87.12 873.11 Wed Mar 30 21:15:25 2005 700 44169.30 79 63.0990 92.24 1008.77 Wed Mar 30 21:16:58 2005 800 43097.94 79 53.8724 108.03 1155.88 Wed Mar 30 21:18:47 2005 900 41846.75 79 46.4964 125.17 1303.38 Wed Mar 30 21:20:52 2005 1000 40247.85 79 40.2478 144.60 1442.21 Wed Mar 30 21:23:17 2005 Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] sync_page() smp_mb() commentWilliam Lee Irwin III1-1/+19
The smp_mb() is becaus sync_page() doesn't have PG_locked while it accesses page_mapping(page). The comments in the patch (the entire patch is the addition of this comment) try to explain further how and why smp_mb() is used. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] RLIMIT_MEMLOCK checking fixChris Wright1-4/+6
Always use page counts when doing RLIMIT_MEMLOCK checking to avoid possible overflow. Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] count bounce buffer pages in vmstatKAMEZAWA Hiroyuki2-0/+3
This is a patch for counting the number of pages for bounce buffers. It's shown in /proc/vmstat. Currently, the number of bounce pages are not counted anywhere. So, if there are many bounce pages, it seems that there are leaked pages. And it's difficult for a user to imagine the usage of bounce pages. So, it's meaningful to show # of bouce pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] mm: use __GFP_NOMEMALLOCNick Piggin1-19/+8
Use the new __GFP_NOMEMALLOC to simplify the previous handling of PF_MEMALLOC. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] mempool: simplify allocNick Piggin1-21/+9
Mempool is pretty clever. Looks too clever for its own good :) It shouldn't really know so much about page reclaim internals. - don't guess about what effective page reclaim might involve. - don't randomly flush out all dirty data if some unlikely thing happens (alloc returns NULL). page reclaim can (sort of :P) handle it. I think the main motivation is trying to avoid pool->lock at all costs. However the first allocation is attempted with __GFP_WAIT cleared, so it will be 'can_try_harder' if it hits the page allocator. So if allocation still fails, then we can probably afford to hit the pool->lock - and what's the alternative? Try page reclaim and hit zone->lru_lock? A nice upshot is that we don't need to do any fancy memory barriers or do (intentionally) racy access to pool-> fields outside the lock. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] mempool: NOMEMALLOC and NORETRYNick Piggin2-10/+19
Mempools have 2 problems. The first is that mempool_alloc can possibly get stuck in __alloc_pages when they should opt to fail, and take an element from their reserved pool. The second is that it will happily eat emergency PF_MEMALLOC reserves instead of going to their reserved pools. Fix the first by passing __GFP_NORETRY in the allocation calls in mempool_alloc. Fix the second by introducing a __GFP_MEMPOOL flag which directs the page allocator not to allocate from the reserve pool. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] mm: pcp use non powers of 2 for batch sizeNick Piggin1-0/+12
Jack Steiner reported this to have fixed his problem (bad colouring): "The patches fix both problems that I found - bad coloring & excessive pages in pagesets." In most workloads this is not likely to be such a pronounced problem, however it should help corner cases. And avoiding powers of 2 in these types of memory operations is always a good idea. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] mm: rmap.c cleanupNikita Danilov1-63/+50
mm/rmap.c:page_referenced_one() and mm/rmap.c:try_to_unmap_one() contain identical code that - takes mm->page_table_lock; - drills through page tables; - checks that correct pte is reached. Coalesce this into page_check_address() Signed-off-by: Nikita Danilov <nikita@clusterfs.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] RLIMIT_AS checking fixakpm@osdl.org2-8/+22
Address bug #4508: there's potential for wraparound in the various places where we perform RLIMIT_AS checking. (I'm a bit worried about acct_stack_growth(). Are we sure that vma->vm_mm is always equal to current->mm? If not, then we're comparing some other process's total_vm with the calling process's rlimits). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01[PATCH] generic_file_buffered_write fixesakpm@osdl.org1-2/+4
Anton Altaparmakov <aia21@cam.ac.uk> points out: - It calls fault_in_pages_readable() which is completely bogus if @nr_segs > 1. It needs to be replaced by a to be written "fault_in_pages_readable_iovec()". - It increments @buf even in the iovec case thus @buf can point to random memory really quickly (in the iovec case) and then it calls fault_in_pages_readable() on this random memory. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-24[PATCH] mempolicy.c GFP fixAl Viro1-1/+1
zonelist_policy() forgot to mask non-zone bits from gfp when comparing zone number with policy_zone. ACKed-by: Andi Kleen <ak@suse.de> Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-19[PATCH] freepgt: remove FIRST_USER_ADDRESS hackHugh Dickins1-5/+0
Once all the MMU architectures define FIRST_USER_ADDRESS, remove hack from mmap.c which derived it from FIRST_USER_PGD_NR. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-19[PATCH] freepgt: sys_mincore ignore FIRST_USER_PGD_NRHugh Dickins1-3/+0
Remove use of FIRST_USER_PGD_NR from sys_mincore: it's inconsistent (no other syscall refers to it), unnecessary (sys_mincore loops over vmas further down) and incorrect (misses user addresses in ARM's first pgd). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-19[PATCH] freepgt: free_pgtables from FIRST_USER_ADDRESSHugh Dickins1-3/+8
The patches to free_pgtables by vma left problems on any architectures which leave some user address page table entries unencapsulated by vma. Andi has fixed the 32-bit vDSO on x86_64 to use a vma. Now fix arm (and arm26), whose first PAGE_SIZE is reserved (perhaps) for machine vectors. Our calls to free_pgtables must not touch that area, and exit_mmap's BUG_ON(nr_ptes) must allow that arm's get_pgd_slow may (or may not) have allocated an extra page table, which its free_pgd_slow would free later. FIRST_USER_PGD_NR has misled me and others: until all the arches define FIRST_USER_ADDRESS instead, a hack in mmap.c to derive one from t'other. This patch fixes the bugs, the remaining patches just clean it up. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-19[PATCH] freepgt: mpnt to vma cleanupHugh Dickins1-18/+17
While dabbling here in mmap.c, clean up mysterious "mpnt"s to "vma"s. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-19[PATCH] freepgt: hugetlb_free_pgd_rangeHugh Dickins1-13/+23
ia64 and ppc64 had hugetlb_free_pgtables functions which were no longer being called, and it wasn't obvious what to do about them. The ppc64 case turns out to be easy: the associated tables are noted elsewhere and freed later, safe to either skip its hugetlb areas or go through the motions of freeing nothing. Since ia64 does need a special case, restore to ppc64 the special case of skipping them. The ia64 hugetlb case has been broken since pgd_addr_end went in, though it probably appeared to work okay if you just had one such area; in fact it's been broken much longer if you consider a long munmap spanning from another region into the hugetlb region. In the ia64 hugetlb region, more virtual address bits are available than in the other regions, yet the page tables are structured the same way: the page at the bottom is larger. Here we need to scale down each addr before passing it to the standard free_pgd_range. Was about to write a hugely_scaled_down macro, but found htlbpage_to_page already exists for just this purpose. Fixed off-by-one in ia64 is_hugepage_only_range. Uninline free_pgd_range to make it available to ia64. Make sure the vma-gathering loop in free_pgtables cannot join a hugepage_only_range to any other (safe to join huges? probably but don't bother). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-19[PATCH] freepgt: remove MM_VM_SIZE(mm)Hugh Dickins2-18/+16
There's only one usage of MM_VM_SIZE(mm) left, and it's a troublesome macro because mm doesn't contain the (32-bit emulation?) info needed. But it too is only needed because we ignore the end from the vma list. We could make flush_pgtables return that end, or unmap_vmas. Choose the latter, since it's a natural fit with unmap_mapping_range_vma needing to know its restart addr. This does make more than minimal change, but if unmap_vmas had returned the end before, this is how we'd have done it, rather than storing the break_addr in zap_details. unmap_vmas used to return count of vmas scanned, but that's just debug which hasn't been useful in a while; and if we want the map_count 0 on exit check back, it can easily come from the final remove_vm_struct loop. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-19[PATCH] freepgt: free_pgtables use vma listHugh Dickins2-118/+136
Recent woes with some arches needing their own pgd_addr_end macro; and 4-level clear_page_range regression since 2.6.10's clear_page_tables; and its long-standing well-known inefficiency in searching throughout the higher-level page tables for those few entries to clear and free: all can be blamed on ignoring the list of vmas when we free page tables. Replace exit_mmap's clear_page_range of the total user address space by free_pgtables operating on the mm's vma list; unmap_region use it in the same way, giving floor and ceiling beyond which it may not free tables. This brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled, in which case latency fixes spoil unmap_vmas throughput). Beware: the do_mmap_pgoff driver failure case must now use unmap_region instead of zap_page_range, since a page table might have been allocated, and can only be freed while it is touched by some vma. Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted from the clear_page_range levels. (Most of free_pgtables' old code was actually for a non-existent case, prev not properly set up, dating from before hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we might want to add latency lockdrops later; but no attempt to do so yet, going by vma should itself reduce latency. But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful examination: put that off until a later patch of the series. What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma? And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that we need to do more than is done here - every PMD_SIZE ever occupied will be flushed, do we really have to flush every PGDIR_SIZE ever partially occupied? A shame to complicate it unnecessarily. Special thanks to David Miller for time spent repairing my ceilings. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16[PATCH] vmscan: pageout(): remove unneeded testakpm@osdl.org1-1/+1
) We only call pageout() for dirty pages, so this test is redundant. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16[PATCH] oom-killer disable for iscsi/lvm2/multipath userland critical sectionsAndrea Arcangeli1-1/+1
iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so make the magical value of -17 in /proc/<pid>/oom_adj defeat the oom-killer altogether. (akpm: we still need to document oom_adj and friends in Documentation/filesystems/proc.txt!) Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16[PATCH] filemap_getpage can block when MAP_NONBLOCK specifiedJeff Moyer1-1/+6
We will return NULL from filemap_getpage when a page does not exist in the page cache and MAP_NONBLOCK is specified, here: page = find_get_page(mapping, pgoff); if (!page) { if (nonblock) return NULL; goto no_cached_page; } But we forget to do so when the page in the cache is not uptodate. The following could result in a blocking call: /* * Ok, found a page in the page cache, now we need to check * that it's up-to-date. */ if (!PageUptodate(page)) goto page_not_uptodate; Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds37-0/+28187
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!