..:: “Prescott” Microarchitecture Improvements ::..
As we have already stated, “Prescott” is more of an evolutionary development in the Pentium 4 line than the revolutionary chip many originally thought it to be. As the clock frequencies of the Pentium 4 chips keep growing, they have come across an ever increasing problem dealing with the latency penalties during times when needed data is not stored and ready to go in the cache, rather it must still be fetched from memory. This problem becomes more and more apparent as the frequency of the processor increases, and thus, several of the improvements that Intel has made to the NetBurst microarchitecture lie with increasing the frequency with which data will be stored within the on-die cache. The other main reason for the bulk of the improvements that have been made to the NetBurst microarchitecture for “Prescott” lies with Hyper-Threading. Intel has been pushing “Prescott” as featuring improved Hyper-Threading Technology, and several enhancements have been made to the NetBurst microarchitecture to allow for this. Hyper-Threading was the main motivation for the bulk of the microarchitecture changes that we will now be discussing. Intel has made improvements in the Front End, Execution Core, Memory System, along with Hyper-Threading specific enhancements.
..:: Front End Improvements ::..
I bet I can guess the first two words that popped into your head once you read about the pipeline length increase for “Prescott.” The first two words that came to my mind when this was confirmed was “Branch Prediction,” and I’m sure I’m not the only one out there that was waiting to see some nice improvements in the Pentium IV’s Branch Prediction to allow for the deeper pipeline. The Front End of the NetBurst microarchitecture utilized by “Prescott” is very much one in the same with the previous version featured on the “Northwood” processors. The instruction fetch and decode segments of the microarchitecture have seen some performance enhancements, but remain much intact with previous incarnations of the Pentium 4 processor.
For those of you that aren’t “in the know” about what exactly branch prediction is, I’m going to try to make a quick explanation that should help you to understand a little better the topic. Modern superscalar processors use an instruction prefetch that guesses whether or not a conditional branch such as an if then else statement, will be taken or not taken. When such a branch is executed, the address of the branch, and the next instruction are stored in what is known as the Branch Target Buffer, or BTB. This data is then used to make a determination as to which direction the instruction will branch off in come time of the next execution. When the branch is correctly predicted, there are no problems within the pipeline and the instruction continues its execution throughout the pipeline. If this prediction that has been made is wrong however, then the pipeline stalls and must be flushed. This causes all of the instructions within the processor itself to be restarted, thereby causing a costly penalty in clock cycles.
The way that Intel’s branch prediction scheme works, is very much in line with the above explanation, although it features several enhancements to the branch prediction scheme, such as static and dynamic prediction algorithms. The portion of the processor that is responsible for the bulk of the prediction is the BTB, or Branch Target Buffer. When the instruction decoder deals with a given instruction it will determine whether or not the instruction is a branch, and whether or not the BTB has a prediction for the instruction it is working with. If the instruction decoder determines that no prediction has been made by the Branch Target Buffer, then a static prediction will be made by the processor.
In previous versions of the NetBurst microarchitecture, the static branch prediction algorithm worked in a way that when it came across a conditional branch, it would make a prediction based on the branch direction. If the direction of the branch in question was backwards, the processor would predict this branch to be taken. If the branch was to jump forward in direction, then the static branch prediction algorithm would predict that the branch would not be taken. This was helpful in the predicting of backwards branch loops, but considering that not all backwards branches are loops, the algorithm was not as effective as it could’ve been when it came to non-loop backwards branches. Several studies were conducted by Intel to determine how to go about addressing this very issue, and how to go about modifying the algorithm to deal with such issues. Intel found that there is actually a threshold for any given branch, and its end target. Intel found that if the branch is larger than the determined threshold, then it was unlikely that it would be a loop-ending branch. The way that Intel has modified the algorithm is that now, if the Branch Target Buffer fails to supply a prediction for a backwards branch, it will automatically predict taken, only if the branch distance is smaller than that of the respective threshold. This helps in addressing the issue of backwards branches that do not qualify as a loop-ending branch.
Further studies conducted by Intel harvested results that conditional branches with given conditions characteristically were more than often not taken, regardless of whether it was a forward or backwards branch, or the threshold distance of the given branch. When the instruction decoder detects such a branch, and the BTB has not made a prediction for this type of branch, the static branch prediction algorithm has been modified to assign a prediction of not taken.
The static prediction algorithm was not the only aspect of the Front End to receive improvements. Intel also addressed some issues that they determined were prevalent with the dynamic branch prediction algorithms. One of the major enhancements that were made to the dynamic prediction algorithm was to add an indirect branch predictor, a feature that was first added to the Pentium M processor. Due to the results that Intel received with the Pentium M, they chose to add on an indirect branch prediction unit onto the NetBurst architecture as well.
According to test data recorded by Intel, these various branch prediction enhancements, especially with the addition on an indirect branch predictor, have allowed for some sizeable gains in the avoidance of mispredicted branches ranging from 10% to 20%, and in some cases as much as 40% or more. These improvements will help offset the additional pipeline stages that Intel has added on to the “Prescott” microarchitecture given that with less misprediction, the pipeline won’t need to be flushed as much as it would with previous implementations.
The last improvement that Intel has made to the Front End of the NetBurst microarchitecture for “Prescott” is that more types of uops can be encoded within the trace cache of the processor. In previous versions of the NetBurst microarchitecture, if an instruction needed a certain uop or set of uops that were not already encoded within the trace cache, it would have to retrieve all of the necessary uops for the instruction from the onboard Microcode ROM. By adding more uops to the trace cache, the execution core now has an increased uop bandwidth, and can cut down on the amount of times it must access the Microcode ROM to acquire a given uop. This was made possible due to the enhancements with the size of the trace cache, doubling from the previous amount of 8KB to 16KB. New types of uops that can be encoded within the trace are software prefetch instructions, and some source register operands.