Intel 3.20GHz “Prescott” Pentium IV

..:: SSE3 Instruction Set – P.N.I. ::..

Yet another new improvement that comes along with the “Prescott” processor is that of the thirteen new instructions that make up the SSE3 instruction set. Now, as we all know it will take some time for new software to be written, or re-coded to work with these new SSE3 instructions and add new optimizations, so for now these new instructions are really out of the performance picture. As time progresses, we will see software that has been optimized for SSE3, and only then will we be able to realize the performance enhancements that it brings. The bulk of the SSE3 instructions all allow for more optimized coding schemes for software, which should increase overall performance of the given software. Given the complexity of some of these new instructions, I’m only going to graze the surface about the SSE3 instructions to give you a rough idea about the optimizations that each allows. 


The thirteen new instructions included in SSE3 allow for optimizations of x87 to Integer conversions, Complex Arithmetic, Video Encoding, Graphics, and HT optimizations. With the exception of only three of these instructions, those being the x87 to Integer conversion and HT instructions, all of these new enhancements utilize the SSE registers. The first new instruction we will discuss is the fisttp instruction for x87 to Integer conversions. Basically, when a number is converted from x87 to an integer, it must meet a given specification for C/C++ and other languages. In order to do so, the rounding mode must be configured to the “chop” setting. With the older fistp instruction, the default rounding mode was “even” which failed to allow for the proper rounding accuracy required. This new instruction always uses “chop” as the rounding mode and is able to ignore something called the FP Control Word which is a feature of the x87 value. With this instruction, coding can be cut down from several lines to one single line. 


Next up, we have the five operations that are meant for complex arithmetic applications. Complex arithmetic is omnipresent in the world of computer hardware and software, especially in the field of multimedia applications in the form of digital filters, such as IIR and FIR filters. There are, of course, numerous other areas where complex arithmetic is utilized, but trying to discuss all of these aspects would be impossible due to the relative use of complex arithmetic in computing. Intel has added five new instructions to SSE3 that are meant to accelerate complex arithmetic operations. Two of the instructions are meant for floating-point addition and subtraction, addsubps and addsubpd. These instructions do not need to allow for changing the sign of some operands as was previously required. The remaining three instructions, movsldup, movshdup, and movddup combine operations of loads with some forms of duplication. The coding schemes when using these news instructions is slightly smaller than that of old, but the real advantages come from the decreased processing time required to complex the operations. 


The lddqu SSE3 instruction goes about addressing a cache line split problem that is often seen under video encoding applications. The most computationally intensive portion of a video encoding system is predominantly the Motion Estimation portion. Motion Estimation brings out a problem with the NetBurst microarchitecture in that is does not support a uop to load the 128-bit unaligned data. In order to handle this data set, the data would be split into two separate 64-bit loads, potentially causing a cache line split. This new instruction takes on this problem by loading 32-byte blocks aligned on a 16-byte boundary and extracting the 16-bytes of unaligned data. This actually causes the processor to use more space than is actually requested, but it will avoid penalty causing cache line splits and improve encoding performance. 


The next four instructions, haddpd, hsubpd, haddps, and hsubps all deal with the topic of vertex database storage with graphics. Most vertex databases are stored in what is known as an array of structures, or AOS. The original SSE instructions failed to handle AOS implementations well, as that instruction set was meant to work with a structure of arrays, or SOA, setup rather than AOS. Because AOS is the more common application for vertex databases, SSE would have to load the data and then shuffle it into the proper order for computation. With these new instructions, Intel has added improved capabilities to a topic that many us are familiar with, that being vertex shading. Typically, the most common computation dealing with vertex shading is taking the scalar product. Due to SSE’s inherent problems with the AOS format, and lack of proper horizontal instructions, Intel has added these four new instructions to further aid in such computations, and speed up the overall process. 


The last of these new SSE3 instructions deals with the topic of Thread Synchronization. These two new instructions, monitor and mwait, are obviously meant to put further use to Intel’s Hyper-Threading Technology and increase the performance capabilities of HTT. The monitor instruction can be used within software to hint to the processor a thread is not doing any useful work at that time. The monitor instruction sets up the hardware to detect stores into an address range for a given thread. When it detects that a thread is not taking part in an active computation, a flag is triggered and the mwait instruction will put the processor into a special low-power state. Once it enters this state, it will begin to monitor the address range that is being utilized by the thread in question. Once there is a store within this address rage, or perhaps an interrupt or fault, then the processor will let go of this low-power state and go back to normal operation.

Each of these newly added instructions that belong to the SSE3 instruction set all add to the relative multimedia capabilities of the Pentium 4 processor. When the Pentium IV originally debuted a few years back, Intel was touting it as a multimedia powerhouse, and from the added instruction set that we are seeing here, it seems Intel is still pushing to keep a full head of steam in the multimedia area of their desktop processors. The new thread synchronization instructions will further help Intel in pushing forward with their Hyper-Threading Technology by assisting the processor in detecting when a thread isn’t taking part in active work, and therefore the processor can either go into a low-power mode, or work with a separate thread. The only problem with adding a new instruction set is that it will take some time for the software applications to take full advantage of these new instructions, so right off the bat we won’t see many performance enhancements. Much like when the original Pentium 4 debuted, it will take time for software to be coded to take advantage of the modified NetBurst microarchitecture, and these thirteen new SSE3 instructions.