The page fault handler for anonymous pages can generate significant overhead
apart from its essential function which is to clear and setup a new page
table entry for a never accessed memory location. This overhead increases
significantly in an SMP environment.

In the page table scalability patches, we addressed the issue by changing
the locking scheme so that multiple fault handlers are able to be processed
concurrently on multiple cpus. This patch attempts to aggregate multiple
page faults into a single one. It does that by noting
anonymous page faults generated in sequence by an application.

If a fault occurred for page x and is then followed by page x+1 then it may
be reasonable to expect another page fault at x+2 in the future. If page
table entries for x+1 and x+2 would be prepared in the fault handling for
page x+1 then the overhead of taking a fault for x+2 is avoided. However
page x+2 may never be used and thus we may have increased the rss
of an application unnecessarily. The swapper will take care of removing
that page if memory should get tight.

The following patch makes the anonymous fault handler anticipate future
faults. For each fault a prediction is made where the fault would occur
(assuming linear acccess by the application). If the prediction turns out to
be right (next fault is where expected) then a number of pages is
preallocated in order to avoid a series of future faults. The order of the
preallocation increases by the power of two for each success in sequence.

The first successful prediction leads to an additional page being allocated.
Second successful prediction leads to 2 additional pages being allocated.
Third to 4 pages and so on. The max order is 3 by default. In a large
continous allocation the number of faults is reduced by a factor of 8.

The patch may be combined with the page fault scalability patch (another
edition of the patch is needed which will be forthcoming after the 
page fault scalability patch has been included). The combined patches
will triple the possible page fault rate from ~1 mio faults sec to 3 mio
faults sec.

Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
number of threads (and thus increasing parallellism of page faults):

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
 32   3    2    1.397s    148.523s  78.044s 41965.149  80201.646
 32   3    4    1.390s    152.618s  44.044s 40851.258 141545.239
 32   3    8    1.500s    374.008s  53.001s 16754.519 118671.950
 32   3   16    1.415s   1051.759s  73.094s  5973.803  85087.358
 32   3   32    1.867s   3400.417s 117.003s  1849.186  53754.928
 32   3   64    5.361s  11633.040s 197.034s   540.577  31881.112
 32   3  128   23.387s  39386.390s 332.055s   159.642  18918.599
 32   3  256   15.409s  20031.450s 168.095s   313.837  37237.918
 32   3  512   18.720s  10338.511s  86.047s   607.446  72752.686

Patched kernel:

Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.098s    138.544s 139.063s 45053.657  45057.920
 32   3    2    1.022s    127.770s  67.086s 48849.350  92707.085
 32   3    4    0.995s    119.666s  37.045s 52141.503 167955.292
 32   3    8    0.928s     87.400s  18.034s 71227.407 342934.242
 32   3   16    1.067s     72.943s  11.035s 85007.293 553989.377
 32   3   32    1.248s    133.753s  10.038s 46602.680 606062.151
 32   3   64    5.557s    438.634s  13.093s 14163.802 451418.617
 32   3  128   17.860s   1496.797s  19.048s  4153.714 322808.509
 32   3  256   13.382s    766.063s  10.016s  8071.695 618816.838
 32   3  512   17.067s    369.106s   5.041s 16291.764 1161285.521

These number are roughly equal to what can be accomplished with the
page fault scalability patches.

Kernel patches with both the page fault scalability patches and
prefaulting:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32  10    1    4.103s    456.384s 460.046s 45541.992  45544.369
 32  10    2    4.005s    415.119s 221.095s 50036.407  94484.174
 32  10    4    3.855s    371.317s 111.076s 55898.259 187635.724
 32  10    8    3.902s    308.673s  67.094s 67092.476 308634.397
 32  10   16    4.011s    224.213s  37.016s 91889.781 564241.062
 32  10   32    5.483s    209.391s  27.046s 97598.647 763495.417
 32  10   64   19.166s    219.925s  26.030s 87713.212 797286.395
 32  10  128   53.482s    342.342s  27.024s 52981.744 769687.791
 32  10  256   67.334s    180.321s  15.036s 84679.911 1364614.334
 32  10  512   66.516s     93.098s   9.015s131387.893 2291548.865

The fault rate doubles when both patches are applied.

And on the high end (512 processors allocating 256G) (No numbers
for regular kernels because they are extremely slow, also no
number for a low number of threads. Also very slow)

With prefaulting:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
256   3    4    8.241s   1414.348s 449.016s 35380.301 112056.239
256   3    8    8.306s   1300.982s 247.025s 38441.977 203559.271
256   3   16    8.368s   1223.853s 154.089s 40846.272 324940.924
256   3   32    8.536s   1284.041s 110.097s 38938.970 453556.624
256   3   64   13.025s   3587.203s 110.010s 13980.123 457131.492
256   3  128   25.025s  11460.700s 145.071s  4382.104 345404.909
256   3  256   26.150s   6061.649s  75.086s  8267.625 663414.482
256   3  512   20.637s   3037.097s  38.062s 16460.435 1302993.019

Page fault scalability patch and prefaulting. Max prefault order
increased to 5 (max preallocation of 32 pages):

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
256  10    8   33.571s   4516.293s 863.021s 36874.099 194356.930
256  10   16   33.103s   3737.688s 461.028s 44492.553 363704.484
256  10   32   35.094s   3436.561s 321.080s 48326.262 521352.840
256  10   64   46.675s   2899.997s 245.020s 56936.124 684214.256
256  10  128   85.493s   2890.198s 203.008s 56380.890 826122.524
256  10  256   74.299s   1374.973s  99.088s115762.963 1679630.272
256  10  512   62.760s    706.559s  53.027s218078.311 3149273.714

We are getting into an almost linear scalability in the high end with
both patches and end up with a fault rate > 3 mio faults per second.

The one thing that takes up a lot of time is still be the zeroing
of pages in the page fault handler. There is a another
set of patches that I am working on which will prezero pages
and led to another an increase in performance by a factor of 2-4
(if prezeroed pages are available which may not always be the case).
Maybe we can reach 10 mio fault /sec that way.

Patch against 2.6.10-rc3-bk3:

Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h	2004-12-01 10:37:31.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-12-01 10:38:15.000000000 -0800
@@ -537,6 +537,8 @@
 #endif
 
 	struct list_head tasks;
+	unsigned long anon_fault_next_addr;	/* Predicted sequential fault address */
+	int anon_fault_order;			/* Last order of allocation on fault */
 	/*
 	 * ptrace_list/ptrace_children forms the list of my children
 	 * that were stolen by a ptracer.
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c	2004-12-01 10:38:11.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-01 10:45:01.000000000 -0800
@@ -55,6 +55,7 @@
 
 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <linux/pagevec.h>
 
 #ifndef CONFIG_DISCONTIGMEM
 /* use the per-pgdat data instead for discontigmem - mbligh */
@@ -1432,8 +1433,106 @@
 		unsigned long addr)
 {
 	pte_t entry;
-	struct page * page = ZERO_PAGE(addr);
+	struct page * page;
+	
+	addr &= PAGE_MASK;
+
+ 	if (current->anon_fault_next_addr == addr) {
+ 		unsigned long end_addr;
+ 		int order = current->anon_fault_order;
+ 
+		/* Sequence of page faults detected. Perform preallocation of pages */
 
+		/* The order of preallocations increases with each successful prediction */
+ 		order++;
+		
+		if ((1 << order) < PAGEVEC_SIZE)
+			end_addr = addr + (1 << (order + PAGE_SHIFT));
+		else
+			end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
+
+		if (end_addr > vma->vm_end)
+			end_addr = vma->vm_end;
+		if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
+			end_addr &= PMD_MASK;
+
+		current->anon_fault_next_addr = end_addr;
+	 	current->anon_fault_order = order;
+ 
+		if (write_access) {
+
+			struct pagevec pv;
+			unsigned long a;
+			struct page **p;
+
+			pte_unmap(page_table);
+			spin_unlock(&mm->page_table_lock);
+
+			pagevec_init(&pv, 0);
+
+			if (unlikely(anon_vma_prepare(vma)))
+				return VM_FAULT_OOM;
+
+			/* Allocate the necessary pages */
+			for(a = addr;a < end_addr ; a += PAGE_SIZE) {
+				struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
+
+				if (p) {
+					clear_user_highpage(p, a);
+					pagevec_add(&pv,p);
+				} else
+					break;
+			}
+			end_addr = a;
+
+			spin_lock(&mm->page_table_lock);
+
+ 			for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) {
+
+				page_table = pte_offset_map(pmd, addr);
+				if (!pte_none(*page_table)) {
+					/* Someone else got there first */
+					page_cache_release(*p);
+					pte_unmap(page_table);
+					continue;
+				}
+				
+ 				entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
+ 							 vma->vm_page_prot)),
+ 						      vma);
+
+				mm->rss++;
+				lru_cache_add_active(*p);
+				mark_page_accessed(*p);
+				page_add_anon_rmap(*p, vma, addr);
+
+				set_pte(page_table, entry);
+				pte_unmap(page_table);
+
+ 				/* No need to invalidate - it was non-present before */
+ 				update_mmu_cache(vma, addr, entry);
+			}
+ 		} else {
+ 			/* Read */
+ 			for(;addr < end_addr; addr += PAGE_SIZE) {
+				page_table = pte_offset_map(pmd, addr);
+ 				entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+				set_pte(page_table, entry);
+				pte_unmap(page_table);
+
+ 				/* No need to invalidate - it was non-present before */
+				update_mmu_cache(vma, addr, entry);
+	
+			};
+		}
+		spin_unlock(&mm->page_table_lock);
+		return VM_FAULT_MINOR;
+	}
+
+	current->anon_fault_next_addr = addr + PAGE_SIZE;
+	current->anon_fault_order = 0;
+ 
+	page = ZERO_PAGE(addr);
 	/* Read-only mapping of ZERO_PAGE. */
 	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));