Intel 3.20GHz “Prescott” Pentium IV

..:: Execution Core Improvements ::..

The original execution core of the NetBurst microarchitecture featured a great strength when it came to ALU performance, that being the fact that the ALU’s within the Pentium 4 ran at twice the core clock frequency. By running the ALU’s at double the frequency of the rest of the processor, Intel allowed the Pentium 4 a high throughput capability for typical logical and math calculations. An enhancement that has been made to this portion of the microarchitecture on “Prescott” is that Intel has now added something called a shifter / rotator block to one of the ALU units. This new block allows for the most common types of shift and rotate instructions to be executed on a double pumped ALU, and not executed as complex integers as was done on previous Pentium 4’s. 


The “Prescott” integer multiply operation has also seen improvements. On older Pentium 4’s, the processor executed these operations using the floating point multiplier. This tacked on costly latency penalties because the data had to be moved into the proper FP unit, and the when a result was obtained, had to be moved back into integer portion. With “Prescott,” Intel has chosen to implement a dedicated integer multiplier for these operations. By adding this multiplier onto the core, the integer multiply operations no longer need to be shifted around, causing latency penalties, and will decrease the amount of clock cycles needed to finish the computations.

The next aspect of the NetBurst microarchitecture that has been improved from previous version of the uops Pentium 4 deals with the schedulers. In order to keep the execution core of the Pentium 4 constantly fed with new instructions and data, these schedulers must be able to run at high speeds. With “Prescott,” these schedulers are still capable of scheduling six uops per cycle, although several enhancements have been made to the schedulers themselves. Both schedulers that are responsible for the x87 and SSE/2/3 instructions were increased in size to allow for increased capabilities in multimedia applications. Intel also chose to increase the size of the effective queues that are responsible for feeding the schedulers with new instructions and operations. Now, more uops can be exchanged between the allocation and scheduling units before a stall. The advantages of these improvements lie in the fact that the allocation units and other logic can now examine more of the instruction stream during times when the schedulers are full. Intel has also made some improvements in the way load uops are scheduled, though we won’t be covering this due to the relative complexity.


..:: Memory System Improvements ::..

The last area of the “Prescott” NetBurst microarchitecture that we’ll be covering is the memory system. Several changes have been made to this portion of the microarchitecture that are geared towards increasing overall performance, and reducing time spent waiting on needed data to be retrieved from the DRAM. The first improvement that was made is a relatively basic one, that being to increase the size of the L2 cache to 1MB, double that of the “Northwood” core which only featured 512KB of L2 cache. By increasing the cache size, more of the data that would need to be stored in DRAM can now be transferred into the L2 cache, where the processor can have access to it at a faster rate. The L2 cache featured on “Prescott” is a writeback 8-way set associative cache that contains 128-byte lines, much like previous Pentium 4 cache implementations. Other resource issues that were address by “Prescott” are that now, thirty two stores can be outstanding in the processor, while in previous versions only twenty four could. The number of write-combining buffers has also been increased to eight, up from six, while the outstanding loads that can be serviced by the L1 cache has been doubled from four to eight. 


Another improvement was made with the way the memory system handles software prefetch instructions. Such instructions allow the programmer to code data transfers from the DRAM to the L2 cache before the data is actually used. When the processor then needs this data, it uses the built-in prefetch instructions to find the data in the L2 cache, and if there is a cache miss, it will then start a prefetch on the FSB to locate the data within the DRAM. Previously, on a TLB miss the prefetch operations were dropped. With the improvements made on “Prescott” special fault-handling logic was added that handle a TLB miss by allowing software prefetches to do page table walks in order to attain the necessary data from the TLB. When a TLB miss happens, or there is a page fault, they are quickly dealt with and go unreported to the operating system, while the prefetch isn’t performed. This new implementation on “Prescott” allows for both software prefetch data, and prefetch page table entries to be stored. 


The last improvement that was made to the memory system deals once again with the idea of prefetching, although this time we are going to discuss the hardware prefetch capabilities of the processor. The hardware prefetch unit on “Prescott” attempts to find streams of data and attempts to predict what data will be needed by the execution stream next. The processor will then actively locate the data stored in the DRAM, and fetch it to bring it into the L2 cache. “Prescott” features an improved hardware prefetcher in that it has been improved over previous versions in the accuracy of detecting when it needs to prefetch data, and exactly what data it needs to retrieve.

As we have seen, Intel has made several improvements to their NetBurst microarchitecture with the latest incarnation of the Pentium 4 processor. Nearly all of these improvements have been made to address issues found in previous Pentium 4’s, and also to aid in the capabilities of Intel’s Hyper-Threading Technology. When word began to travel about the pipeline increase for “Prescott” many enthusiasts began to wonder whether or not the branch prediction scheme for the processor would be enhanced to counteract the deepened pipeline. As we have just shown, Intel has indeed made several improvements to both the static and dynamic branch prediction algorithms. Several of the enhancements made to both the memory system and execution core also deal with issues that were raised through the development of the HT Technology. The only question is, how much added performance do these improvements bring, and, can these improvements counteract the deepened pipeline enough to make “Prescott” attractive enough to potential buyers over the “Northwood” Pentium 4’s.