1_Introduction
Five years after the advent of its first Phenom Stars using the architecture, AMD is preparing to upgrade its offer for x86 processors. Of course, we did have along the way a revision of the Stars architecture with the arrival of Phenom II, but nothing fundamentally new and especially nothing that can not challenge the dominance of Intel ultra-niche on the performance in particular.
In this season, and not without some delay, AMD finally launches its new architecture by the name of code Bulldozer. AMD and to revive the brand for the occasion FX, as the first processor architecture using Bulldozer FX are baptized, a name that will remind many memories for fans of Athlon 64. But if at the time the Athlon FX processors were simple heart, the new FX offers us in the late 2011 AMD models are 8-core, no less!
The return of the FX brand is it synonymous with performance feedback? How to position the new FX processors face the current supply of Intel? These questions will be answered in the following pages.
2_Bulldozer architecture: space for modules
2.1_Architecture made of modules revisited
With Phenom and Phenom II, AMD K10 architecture we proposed, codenamed Stars. Relatively effective in certain areas, but it struggled to stand comparison with the offer from Intel. The gap condition became even more acute since the appearance of Sandy Bridge processors, the famous second-generation Core. The arrival of the first real sign Bulldozer new micro-architecture from AMD Phenom or even be ahead. Some would argue, in fact, that architecture is a derivative of K10 K8 Athlon 64 ... We will not go into these debates preferring to take you on an overview of the features of this architecture.
Among the objectives of AMD in the development of Bulldozer there are energy efficiency, modularity of the architecture or the fact to support the operating frequencies to the effectiveness of the architecture, a rather risky gamble we shall see.
The big news of the Bulldozer architecture is above all the CMT (Cluster Multithreading). This is an approach quite interesting and rather different from what we usually see, especially at Intel where the SMT (Simultaneous Multithreading) reigns supreme. With Bulldozer, AMD challenges, in a sense, the notion of heart run x86. One of the basic concepts of Bulldozer is in fact what AMD calls a module. Bulldozer is a module consists of two execution cores x86. Two true hearts? Not quite ... Our two hearts will indeed share a number of resources as the party responsible for data fetch (ie they are loaded), but also the instruction decoding unit or the one in charge of the calculations on floating. The second level cache is shared between both cores of the same module which here is clearly the element the least innovative.
Concept module Bulldozer
But why share units between our hearts in a single module rather than duplicating? The reason is simple: save transistors course and therefore the die area, but also optimize the consumption of the processor. Typically, it is recognized that in the x86 unit in charge of decoding is quite voracious energy. The shared between two cores to optimize parameters such energy. More pragmatically, other units are rarely sought or continuously at full capacity. This is particularly true of the unit dedicated to calculations on floating-point numbers. In fact, the pool between two hearts is actually quite relevant since it is then truly sought. Each module Bulldozer therefore has two cores and two threads can run simultaneously.
Note flexibility induced by this operation: if one thread is running within the module, it has access to all shared resources. To this is added a certain fluidity modularity for AMD. The smelter will feature the launch of variants 6 and 8 core processors of its FX: in the first case the processor will consist of three modules Bulldozer in the second it will carry a total of four modules. Furthermore this design is supposed to a rise in frequency facilitated in accordance with the objectives stated above. We remember that the material AMD had some difficulties in recent years.
The hearts of 8 FX 8150 seen by Windows
2.2_Beyond the modules!
The modules are not the only new Bulldozer architecture, far from it. Thus, the front-end is also changing. This block, which ensures the constant supply of instruction execution units, is now shared between the cores of the same module. It has therefore been revised accordingly to ensure a sustained throughput. Here AMD's engineers have focused most of their attention on the management of connections. Recall that a connection is nothing more than a jump in the code. Clearly we move from one end to another code if a condition is and the idea is to provide connections to speed prefetching or data necessary for the successful continuation of the code.
Thus, the prediction pipelines and power are now decoupled while AMD introduced a number of mechanisms well known ... Intel. There are, collage, management of direct and indirect branches with a buffer at two levels, the loop detection or the presence of a trace cache to store, as on Nehalem micro-instructions already decoded. On the decoding unit dedicated to this function can handle up to four Bulldozer instructions per cycle, against the previous three Phenom.
Following on Clubic.com: FX 8150, aka Bulldozer, the new 8-core processor from AMD: Architecture Bulldozer: up to modules
Computer and high tech
Focus on front-end
As for the cache, the Bulldozer module has 64 KB of memory for first-level instructions running on two routes: it is shared between the cores. Each heart also embeds its own first-level cache with 16 KB for data. By L1D cache is 4-way associative. Question performance, this cache is supposed to have been optimized with, among others, the development of techniques to predict the location of the path of the requested data in the cache.
By the way the arrival of a fusion solution. Like Intel (definitely!) Bulldozer is capable of decoding multiple instructions in a single statement: AMD is called fusion branch.
3_ An architecture that has the heart, head for the DDR3 1866
3.1_ Focus on the hearts!
Du Côté des Cœurs d'Exécution de 86 Every module de Bulldozer, le pipeline de l'ONU Retrouvé à 4 Étages (Contre 3 précédemment). Il Est découpé AVEC UN Côté d'ALU ous Deux Unités d'Exécution Arithmétiques et Logiques et 2 AGU ous Unités de Génération d'adresses. Par rapport un Phenom n'avez le pipeline de 3 Étages etait Composé de Trois ALU, Un Cœur d'Exécution Bulldozer NE may donc executer les instructions entières Qué Deux cycle d'horloge nominale Contre 3 versez Un Cœur Phenom. C'EST UN désavantage Théorique Qui ne Tien Pas du Compte seconde Cœur d'Exécution et Qui ne Considère Pas l'aspect Consommation énergétique. Rappelons qu'un module de Bulldozer AVEC SES Deux Cœurs d'Exécution HNE cense être de Nettement Moins vorace Qué Deux Cœurs Phenom.
En Ce Qui Concerne Le Registre, celui-ci Profite D'Une improvement with significative l'Arrivée d'un PRF. Là Encore l'architecture de Sandy Bridge Intel BNO d'une recemment familiarise with this notion. Le PRF HNE UNE Sorte d'index Qui RELIE les Registres nominale Utilise le moteur d'Exécution out-of-commander un Chaleurs entrées de Dans la mémoire tampon. L'Avantage Premier HNE ici de Stocker des entrées de taille supérieure.
Nos Deux Cœurs d'Exécution soi partagent UNE unité centrale chargée des Opérations en 128 bits. Ladite unit HNE composee de deux de canalisations de 128 bits peuvent être de Qui Combinés fr UNE Seule unit 256 bits, y reviendrons BNO. De Type FMAC les pipelines en question peuvent éffectuer fr UNE Seule passe des Opérations de multiplication et d'addition sur des Nombres à virgule flottante et sans CE Aucun arrondi Durant le Calcul versez Précision UNE Maximale. Une Fois unifiées en 256 bits, CES Deux Unités peuvent traiter les instructions Alors AVX Au Rythme D'Une instruction de cycle d'horloge nominale. Le passage à l'ONU PRF Evoque, plus haut HNE of course justifié par L'arrivée de la charge de prix en AVX.

FPU shared core Bulldozer
Since we mentioned the instructions, it's good to add that the management of AVX, AMD adds support SSE instructions SSE 4.1 and 4.2. As if this were not enough, it also adds its own poetically named XOP instructions and not to mention the FMA4 CVT16. These are instructions for completing the AVX and SSE5 from the project a time announced by AMD but never materialized. The FMA4 (Fuse multiply add) is an instruction that stores the result of an operation in a register after performing additional in a single cycle multiplication and addition. Intel uses for its part FMA3 one where the result of said operation is placed in a register previously used. In this regard, the successor to Bulldozer will FMA3, joining the Intel camp. These instructions could make significant gains in high performance applications (applications of computing, scientific, etc.). Only now it's a safe bet that the compilers do not benefit, especially if AMD has already announced that abandoned in the future for FMA4 FMA3 ...

Instructions supported by Bulldozer
Note finally the management of AES for everything related to hardware acceleration related to encryption and decryption operations using the standard of the same name.
What about memory?
We mentioned a little earlier arrangement of the first-level caches. Of course, the Bulldozer architecture not only of this single level cache. So AMD has a second level cache type associations. Arranged on 16 channels, it is shared between the cores and size rises to 2 MB per module then.
A third level of cache is also planned with a maximum of 8 MB shared between all the modules making up the processor. Always voluntary, the L3 cache has 64 tracks. The third level of cache is not inclusive and includes all the data ejected from the L2 cache. In total, therefore, an FX processor architecture Bulldozer can have DE16 MB cache! It is not nothing. As for the latency of caches we found 3 cycles for the L1, L2 for 18, which is very good and 65 for the L3. As feared the latter value is quite high. This is more so a Core of a second-generation latency L3 cache is 57 cycles
Beyond the caches, the Bulldozer architecture integrates the memory controller of course. It was right here in a DDR3 memory controller type dual channel, each channel being interfaced with 8 bits of 64 bits for error correction. The controller supports relatively high frequencies. Thus, with two bars can be installed in the DDR3-1866 on its operating system in practice to 933 MHz. Attention, with four bars it necessarily falls to 800 MHz maximum frequency or the DDR3-1600. AMD also points to support a supply voltage of 1.25 volts for DDR3.
DDR3-1866 and FX 8150