Resistive Computation: Avoiding the Power Wall with Low-Leakage, STT-MRAM Based Computing Xiaochen Guo Engin ˙Ipek Tolga Soyata University of Rochester Rochester, NY 14627 USA {xiguo, ipek, soyata}@ece.rochester.edu ABSTRACT 1. INTRODUCTION As CMOS scales beyond the 45nm technology node, leakage Over the past two decades, the CMOS microprocessor de- concerns are starting to limit microprocessor performance sign process has been confronted by a number of seemingly growth. To keep dynamic power constant across process gen- insurmountable technological challenges (e.g., the memory erations, traditional MOSFET scaling theory prescribes re- wall [4] and the wire delay problem [1]). At each turn, ducing supply and threshold voltages in proportion to device new classes of systems have been architected to meet these dimensions, a practice that induces an exponential increase challenges, and microprocessor performance has continued in subthreshold leakage. As a result, leakage power has be- to scale with exponentially increasing transistor budgets. come comparable to dynamic power in current-generation With more than two billion transistors integrated on a sin- processes, and will soon exceed it in magnitude if voltages gle die [27], power dissipation has become the current crit- are scaled down any further. Beyond this inflection point, ical challenge facing modern chip design. On-chip power multicore processors will not be able to afford keeping more dissipation now exhausts the maximum capability of con- than a small fraction of all cores active at any given moment. ventional cooling technologies; any further increases will re- Multicore scaling will soon hit a power wall. quire expensive and challenging solutions (e.g., liquid cool- This paper presents resistive computation, a new tech- ing), which would significantly increase overall system cost. nique that aims at avoiding the power wall by migrating Multicore architectures emerged in the early 2000s as a most of the functionality of a modern microprocessor from means of avoiding the power wall, increasing parallelism un- CMOS to spin-torque transfer magnetoresistive RAM (STT- der a constant clock frequency to avoid an increase in dy- MRAM)—a CMOS-compatible, leakage-resistant, non-volatile namic power consumption. Although multicore systems did resistive memory technology. By implementing much of manage to keep power dissipation at bay for the past decade, the on-chip storage and combinational logic using leakage- with the impending transition to 32nm CMOS, they are resistant, scalable RAM blocks and lookup tables, and by starting to experience scalability problems of their own. To carefully re-architecting the pipeline, an STT-MRAM based maintain constant dynamic power at a given clock rate, sup- implementation of an eight-core Sun Niagara-like CMT pro- ply and threshold voltages must scale with feature size, but cessor reduces chip-wide power dissipation by 1.7× and leak- this approach induces an exponential rise in leakage power, age power by 2.1× at the 32nm technology node, while main- which is fast approaching dynamic power in magnitude. Un- taining 93% of the system throughput of a CMOS-based de- der this poor scaling behavior, the number of active cores sign. on a chip will have to grow much more slowly than the total transistor budget allows; indeed, at 11nm, over 80% of all cores may have to be dormant at all times to fit within the chip’s thermal envelope [16]. This paper presents resistive computation, an architec- Categories and Subject Descriptors tural technique that aims at developing a new class of power- efficient, scalable microprocessors based on emerging resis- B.3.1 [Memory Structures]: Semiconductor Memories; tive memory technologies. Power- and performance-critical C.1.4 [Processor Architectures]: Parallel Architectures hardware resources such as caches, memory controllers, and floating-point units are implemented using spin-torque trans- fer magnetoresistive RAM (STT-MRAM)—a CMOS- com- General Terms patible, near-zero static-power, persistent memory that has been in development since the early 2000s [12], and is ex- Design, Performance pected to replace commercially available magnetic RAMs by 2013 [13]. The key idea is to implement most of the on- chip storage and combinational logic using scalable, leakage- Keywords resistant RAM arrays and lookup tables (LUTs) constructed from STT-MRAM to lower leakage, thereby allowing many Power-efficiency, STT-MRAM more active cores under a fixed power budget than a pure CMOS implementation could afford. By adopting hardware structures amenable to fast and ef- ficient LUT-based computing, and by carefully re-architecting the pipeline, an STT-MRAM based implementation of an Permission to make digital or hard copies of all or part of this work for eight-core, Sun Niagara-like CMT processor reduces leakage personal or classroom use is granted without fee provided that copies are and total power at 32nm by 2.1× and 1.7×, respectively, not made or distributed for profit or commercial advantage and that copies while maintaining 93% of the system throughput of a pure bear this notice and the full citation on the first page. To copy otherwise, to CMOS implementation. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISCA’10, June 19–23, 2010, Saint-Malo, France. Copyright 2010 ACM 978-1-4503-0053-7/10/06 ...$10.00. 371 2. BACKGROUND AND MOTIVATION trons with polarity anti-parallel to the two layers can travel Simultaneously to power-related problems in CMOS, DRAM through the MTJ easily, while electrons with the same spin is facing severe scalability problems due to precise charge as the two layers are scattered; in contrast, when the two placement and sensing hurdles in deep-submicron processes. layers have anti-parallel polarities, electrons of either polar- In response, the industry is turning its attention to resistive ity are largely scattered by one of the two layers, leading memory technologies such as phase-change memory (PCM), to much lower conductivity, and thus, higher resistance [6]. memristors (RRAM), and spin-torque transfer magnetore- These low and high resistances are used to represent differ- sistive RAM (STT-MRAM)—memory technologies that rely ent logic values. on resistivity rather than charge as the information carrier, The most commonly used structure for an STT-MRAM and thus hold the potential to scale to much smaller geome- memory cell is the 1T-1MTJ cell that comprises a single tries than charge memories [13]. Unlike the case of SRAM MTJ, and a single transistor that acts as an access device or DRAM, resistive memories rely on non-volatile, resistive (Figure 2). Transistors are built in CMOS, and the MTJ information storage in a cell, and thus exhibit near-zero leak- magnetic material is grown over the source and drain re- age in the data array. gions of the transistors through a few (typically two or three) additional process steps. Similarly to SRAM and DRAM, 1T-1MTJ cells can be coupled through wordlines and bit- 2.1 STT-MRAM lines to form memory arrays. Each cell is read by driving STT-MRAM [13,20,31–33] is a second generation MRAM the appropriate wordline to connect the relevant MTJ to technology that addresses many of the scaling problems of its bitline (BL) and source line (SL), applying a small bias commercially available toggle-mode magnetic RAMs. Among voltage (typically 0.1V ) across the two, and by sensing the all resistive memories, STT-MRAM is the closest to being current passing through the MTJ using a current sense am- a CMOS-compatible universal memory technology as it of- plifier connected to the bitline. Read speed is determined fers read speeds as fast as SRAM [39] (< 200ps in 90nm), by how fast the capacitive wordline can be charged to turn density comparable to DRAM (10F 2 ), scalable energy char- on the access transistor, and by how fast the bitline can be acteristics [13], and infinite write endurance. Functional raised to the required read voltage to sample the read-out array prototypes [14, 20, 31], CAM circuits [37], and simu- current. The write operation, on the other hand, requires lated FPGA chips [39] using STT-MRAM have already been activating the access transistor, and applying a much higher demonstrated, and the technology is under rapid commercial voltage (typically Vdd) that can generate enough current to development, with an expected industry-wide switch from modify the spin of the free layer. toggle-mode MRAM to STT-MRAM by 2013 [13]. Although MRAM suffers from relatively high write power and write SL BL latency compared to SRAM, its near-zero leakage power dis- sipation, coupled with its fast read speed and scalability WL makes it a promising candidate to take over as the workhorse for on-chip storage in sub-45nm processes. Memory Cells and Array Architecture. STT-MRAM Figure 2: Illustrative example of a 1T-1MTJ cell. relies on magnetoresistance to encode information. Figure 1 An MTJ can be written in a thermal activation mode depicts the fundamental building block of an MRAM cell, through the application of a long, low-amplitude current the magnetic tunnel junction (MTJ). An MTJ consists of pulse (>10ns), under a dynamic reversal regime with inter- two ferromagnetic layers and a tunnel barrier layer, often mediate current pulses (3-10ns), or in a precessional switch- implemented using a magnetic thin-film stack comprising Co40 F e40 B20 for the ferromagnetic layers, and M gO for the ing regime with a short (<3ns), high-amplitude current pulse [12]. tunnel barrier. One of the ferromagnetic layers, the pinned In a 1T-1MTJ cell with a fixed-size MTJ, a tradeoff exists layer, has a fixed magnetic spin, whereas the spin of the between switching time (i.e., current pulse width) and cell electrons in the free layer can be influenced by first applying area. In precessional mode, the required current density a high-amplitude current pulse through the pinned layer to Jc (τ ) to switch the state of the MTJ is inversely propor- polarize the current, and then passing this spin-polarized tional to switching time τ current through the free layer. Depending on the direction C of the current, the spin polarity of the free layer can be made Jc (τ ) ∝ Jc0 + τ either parallel or anti-parallel to that of the pinned layer. where Jc0 is a process-dependent intrinsic current density parameter, and C is a constant that depends on the angle of Pinned Pinned the magnetization vector of the free layer [12]. Hence, oper- Layer Layer ating at a faster switching time increases energy-efficiency: a 2× shorter write pulse requires a less than 2× increase MgO RP MgO RAP in write current, and thus, lower write energy [8, 20, 26]. Free Free Unfortunately, the highest switching speed possible with a Layer Layer fixed-size MTJ is restricted by two fundamental factors: (1) the maximum current that the cell can can support during an RAP → RP transition cannot exceed RAP /V dd since the (a) (b) cell has to deliver the necessary switching current over the Figure 1: Illustrative example of a magneto-tunnel junction MTJ in its high-resistance state, and (2) a higher switching (MTJ) in (a) low-resistance parallel and (b) high-resistance current requires the access transistor to be sized larger so anti-parallel states. that it can source the required current, which increases cell area 1 and hurts read energy and delay due to higher gate Applying a small bias voltage (typically 0.1V) across the capacitance. MTJ causes a tunneling current to flow through the MgO Figure 3 shows 1T-1MTJ cell switching time as a func- tunnel barrier without perturbing the magnetic polarity of tion of cell area based on Cadence-Spectre analog circuit the free layer. The magnitude of the tunneling current—and simulations of a single cell at the 32nm technology node, thus, the resistance of the MTJ—is determined by the po- using ITRS 2009 projections on MTJ parameters (Table 1), larity of the two ferromagnetic layers: a lower, parallel resis- 1 tance (RP in Figure 1-a) state is experienced when the spin The MTJ is grown above the source and drain regions of polarities agree, and a higher, antiparallel resistance state the access transistor and is typically much smaller than the is observed when the polarities disagree (RAP in Figure 1- transistor itself; consequently, the size of the access transis- b). When the polarities of the two layers are aligned, elec- tor determines cell area. 372 7 menting hardware structures that are read-only or are sel- Switching Time (ns) 6 dom written. Previous work has explored the possibility of 5 leveraging MRAM to design L2 caches [30,34], but this work 4 is the first to consider the possibility of implementing much of the combinational logic on the chip, as well as microar- 3 chitectural structures such as register files and L1 caches, 2 using STT-MRAM. 1 0 3. FUNDAMENTAL BUILDING BLOCKS 0.0 20.0 40.0 60.0 At a high-level, an STT-MRAM based resistive micropro- Cell Size (F2) cessor consists of storage-oriented resources such as register files, caches, and queues; functional units and other combi- Figure 3: 1T-1MTJ cell switching time as a function of cell national logic elements; and pipeline latches. Judicious par- size based on Cadence-Spectre circuit simulations at 32nm. titioning of these hardware structures between CMOS and STT-MRAM is critical to designing a well-balanced system and the BSIM-4 predictive technology model (PTM) of an that exploits the unique area, speed, and power advantages NMOS transistor [38]; results presented here are assumed of each technology. Making this selection correctly requires in the rest of this paper whenever cell sizing needs to be analyzing two broad categories of MRAM-based hardware optimized for write speed. As the precise value of intrinsic units: those leveraging RAM arrays (queues, register files, current density Jc0 is not included in ITRS projections, Jc0 caches), and those leveraging look-up tables (combinational is conservatively assumed to be zero, which requires a 2× logic, functional units). increase in switching current for a 2× increase in switch- ing speed. If feature size is given by F , then at a switching 3.1 RAM Arrays speed of 6.7ns, a 1T-1MTJ cell occupies 10F 2 area—a 14.6× Large SRAM arrays are commonly organized into hier- density advantage over SRAM, which is a 146F 2 technology. archical structures to optimize for area, speed, and power As the access transistor’s W ratio is increased, its current tradeoffs [3]. An array comprises multiple independent banks L sourcing capability improves, which reduces switching time that can be simultaneously accessed through separate ad- dress and data busses to improve throughput. To minimize to 3.1ns at a cell size of 30F 2 . Increasing the size of the tran- wordline and bitline delays and to simplify decoding com- sistor further causes a large voltage drop across the MTJ, plexity, each bank is further divided into subbanks sharing which reduces the drain-source voltage of the access transis- address and data busses; unlike the case of banks, only a tor and pushes the device into deep triode, and ultimately single subbank can be accessed at a time (Figure 4). A limits its current sourcing capability. As a result, switching subbank consists of multiple independent mats sharing an time reaches an asymptote at 2.6ns, which is accomplished address line, each of which supplies a different portion of at a cell size of 65F 2 . a requested data block on every access. Internally, each mat comprises multiple subarrays. Memory cells within each Parameter Value subarray are organized as rows × columns; a decoder selects Cell Size 10F 2 the cells connected to the relevant wordline, whose contents Switching Current 50µA are driven onto a set of bitlines to be muxed and sensed by Switching Time 6.7ns column sensing circuitry; the sensed value is routed back to Write Energy 0.3pJ/bit MTJ Resistance (RLOW /RHIGH ) 2.5kΩ / 6.25kΩ the data bus of the requesting bank through a separate reply network. Different organizations of a fixed-size RAM array into different numbers of banks, subbanks, mats, and sub- Table 1: STT-MRAM parameters at 32nm based on arrays yield dramatically different area, speed, and power ITRS’09 projections. figures [22]. 2.2 Lookup-Table Based Computing Bank Address Sub-bank Field programmable gate arrays (FPGAs) adopt a ver- Bus satile internal organization that leverages SRAM to store Shared Data and truth tables of logic functions [35]. This not only allows Address Busses a wide variety of logic functions to be flexibly represented, but also allows FPGAs to be re-programmed almost indefi- Data Bus nitely, making them suitable for rapid product prototyping. With technology scaling, FPGAs have gradually evolved Figure 4: Illustrative example of a RAM array organized from four-input SRAM-based truth tables to five- and six- into a hierarchy of banks and subbanks [22]. input tables, named lookup tables (LUT) [7]. This evolution is due to increasing IC integration density—when LUTs are MRAM and SRAM arrays share much of this high-level created with higher numbers of inputs, the area they oc- structure with some important differences arising from the cupy increases exponentially; however, place-and-route be- size of a basic cell, from the loading on bitlines and word- comes significantly easier due to the increased functionality lines, and from the underlying sensing mechanism. In turn, of each LUT. The selection of LUT size is technology de- these differences result in different leakage power, access pendent; for example, Xilinx Virtex-6 FPGAs use both five- energy, delay, and area figures. Since STT-MRAM has a and six-input LUTs, which represent the optimum sizing at smaller cell size than SRAM (10F 2 vs 146F 2 ), the length of the 40nm technology node [35]. the bitlines and wordlines within a subarray can be made This paper leverages an attractive feature of LUT-based shorter, which reduces bitline and wordline capacitance and computing other than reconfigurability: since LUTs are con- resistance, and improves both delay and energy. In addition, structed from memory, it is possible to implement them unlike the case of 6T-SRAM where each cell has two access using a leakage-resistant memory technology such as STT- transistors, a 1T-1MTJ cell has a single access device whose MRAM, for dramatically reduced power consumption. Like size is typically smaller than the SRAM access transistor; other resistive memories, MRAM dissipates near-zero leak- this reduces the amount of gate capacitance on wordlines, as age power in the data array; consequently, power density well as the drain capacitance attached to bitlines, which low- can be kept under check by reducing the supply voltage with ers energy and delay. The smaller cell size of STT-MRAM each new technology generation. (Typical MRAM read volt- implies that subarrays can be made smaller, which shortens ages of 0.1V are reported in the literature [20].) Due to its the global H-tree interconnect that is responsible for a large high write power, the technology is best suited to imple- share of the overall power, area, and delay. Importantly, 373 unlike the case of SRAM where each cell comprises a pair of the array size is reduced. With smaller arrays, opportunities cross-coupled inverters connected to the supply rail, STT- to share sense amplifiers and decoding circuitry across mul- MRAM does not require constant connection to Vdd within tiple rows and multiple columns is significantly lower. One a cell, which reduces leakage power within the data array to option to combat this problem would be to utilize very large virtually zero. arrays to implement lookup tables of logic functions; unfor- tunately, both access time and the area overhead deteriorate Handling Long-Latency Writes. Despite these advan- with larger arrays. tages, STT-MRAM suffers from relatively long write laten- Rather than utilizing an STT-MRAM array to implement cies compared to SRAM (Section 2.1). Leveraging STT- a logic function, we rely on a specialized STT-MRAM based MRAM in designing frequently accessed hardware structures lookup table employing differential current-mode logic (DCML). requires (1) ensuring that critical reads are not delayed by Recent work in this area has resulted in fabricated, two- long-latency writes, and (2) long write latencies do not result input lookup tables [8] at 140nm, as well as a non-volatile in resource conflicts that hamper pipeline throughput. full-adder prototype [26]. Figure 6 depicts an example three- One way of accomplishing both of these goals would be input LUT. The circuit needs both complementary and pure to choose a heavily multi-ported organization for frequently forms of each of its inputs, and the LUT produces comple- written hardware structures. Unfortunately, this results in mentary outputs—when multiple LUTs are cascaded in a an excessive number of ports, and as area and delay grow large circuit, there is no need to generate additional comple- with port count, hurts performance significantly. For exam- mentary outputs. ple, building an STT-MRAM based architectural register file that would support two reads and one write per cycle with Vdd fast, 30F 2 cells at 32nm, 4GHz would require two read ports clk clk and 13 write ports, which would increase total port count clk DEC REF from 3 to 15. An alternative option would be to go to a Z Z SA DEC REF heavily multi-banked implementation without incurring the clk A A A A A overhead of extreme multiporting. Unfortunately, as the A B B 3x8 B B B B B B Tree number of banks are increased, so does the number of H- C C C C C C C C C C C C tree wiring resources, which quickly overrides the leakage clk and area benefits of using STT-MRAM. clk Instead, this paper proposes an alternative strategy that allows high write throughput and read-write bypassing with- Figure 6: Illustrative example of a three-input lookup table. out incurring an increase in the wiring overhead. The key idea is to allow long-latency writes to complete locally within This LUT circuit, an expanded version of what is proposed each sub-bank without unnecessarily occupying global H- in [8], utilizes a dynamic current source by charging and tree wiring resources. To make this possible, each subbank discharging the capacitor shown in Figure 6. The capacitor is augmented with a subbank buffer —an array of flip-flops is discharged during the clk phase, and sinks current through (physically distributed across all mats within a subbank) the 3 × 8 decode tree during the clk phase. Keeper PMOS that latch in the data-in and address bits from the H-tree, transistors charge the two entry nodes of the sense amplifier and continue driving the subarray data and address wires (SA) during the clk phase and sensing is performed during throughout the duration of a write while bank-level wiring the clk phase. These two entry nodes, named DEC and resources are released (Figure 5). In RAM arrays with sep- REF, reach different voltage values during the sensing phase arate read and write ports, subbank buffers drive the write (clk) since the sink paths from DEC to the capacitor vs. port only; reads from other locations within the array can from REF to the capacitor exhibit different resistances. The still complete unobstructed, and it also becomes possible to reference MTJ needs to have a resistance between the low read the value being written to the array directly from the and high resistance values; since ITRS projects RLO and subbank buffer. RHIGH values of 2.5kΩ and 6.25kΩ at 32nm, 4.375kΩ is Sub-bank Subbank chosen for RREF . Buffer Although the MTJ decoding circuitry is connected to Vdd at the top and dynamically connected to GND at the bot- tom, the voltage swing on the capacitor is much smaller than Vdd, which dramatically reduces access energy. The output Shared Data and Address Busses of this current mode logic operation is fed into a sense am- Figure 5: Illustrative example of subbank buffers. plifier, which turns the low-swing operation into a full-swing complementary output. Subbank buffers also make it possible to perform differen- In [8], it is observed that the circuit can be expanded tial writes [18], where only bit positions that differ from their to higher numbers of inputs by expanding the decode tree. original contents are modified on a write. For this to work, However, it is important to note that expanding the tree be- the port attached to the subbank buffer must be designed yond a certain height reduces noise margins and makes the as a read-write port; when a write is received, the subbank LUT circuit vulnerable to process variations, since it be- buffer (physically distributed across the mats) latches in the comes increasingly difficult to detect the difference between new data and initiates a read for the original contents. Once high and low MTJ states due to the additional resistance the data arrives, the original contents and the new contents introduced by the transistors in series. As more and more are bitwise XOR’ed to generate a mask indicating those bit transistors are added, their cumulative resistance can be- positions that need to be changed. This mask is sent to all come comparable to MTJ resistance, and fluctuations among relevant subarrays as the enable signals for the bitline drivers transistor resistances caused by process variations can make along with the actual data—in this way, it becomes possible sensing challenging. to perform differential writes without incurring additional latency and energy on the global H-tree wiring. Differential 3.2.1 Optimal LUT Sizing for Latency, Power, and writes can reduce the number of bit flips, and thus write Area energy, by significant margins, and can make STT-MRAM Both the power and the performance of a resistive proces- based implementation of heavily written arrays practical. sor depend heavily on the LUT sizes chosen to implement combinational logic blocks. This makes it necessary to de- 3.2 Lookup Tables velop a detailed model to evaluate latency, area, and power Although large STT-MRAM arrays dissipate near-zero tradeoffs as a function of STT-MRAM LUT size. Figure 7 leakage power in the subarrays, the leakage power of the pe- depicts read energy, leakage power, read delay, and area as ripheral circuitry can be appreciable and in fact dominant as a function of the number of LUT inputs. LUTs with two to 374 six inputs (4-64 MTJs) are studied, which represent realis- of the lookup table increases exponentially with the number tic LUT sizes for real circuits. As a comparison, only five- of inputs. Every new input doubles the number of transis- and six-input LUTs are utilized in modern FPGAs (e.g., Xil- tors in the branches; as LUT size is increased from two to six inx Virtex 6) as higher sizes do not justify the increase in inputs, the area of the LUT increases fivefold. Nevertheless, latency and area for the marginal gain in flexibility when a single LUT can replace approximately 12 CMOS standard implementing logic functions. As each LUT stores only one cells on average when implementing such complex combina- bit of output, multiple LUTs are accessed in parallel with tional logic blocks as a floating-point unit (Section 4.5) or a the same inputs to produce multi-bit results (e.g., a three-bit memory controller’s scheduling logic (Section 4.6.4); conse- adder that produces a four-bit output). quently, analyses shown later in the paper assume six-input LUTs unless otherwise stated. 2.5 600 3.2.2 Case Study: Three-bit Adder using Static CMOS, Read Energy (fJ) 500 Leakage (pW) 2 400 ROM, and STT-MRAM LUT Circuits 1.5 300 To study the power and performance advantages of STT- 1 200 MRAM LUT-based computing on a realistic circuit, Ta- 0.5 100 ble 2 compares access energy, leakage power, area, and de- 0 0 lay figures obtained on three different implementations of a three-bit adder: (1) a conventional, static CMOS imple- 120 1 mentation, (2) a LUT-based implementation using the STT- 100 0.8 MRAM (DCML) LUTs described in Section 3.2, and (3) Area (um2) Delay (ps) 80 a LUT-based implementation using conventional, CMOS- 0.6 60 based static ROMs. Minimum size transistors are used in all 0.4 40 three cases to keep the comparisons fair. Circuit simulations 20 0.2 are performed using Cadence AMS (Spectre) with Verilog- 0 0 based test vector generation; we use 32nm BSIM-4 predictive 1 2 3 4 5 6 7 1 2 3 4 5 6 7 technology models (PTM) [38] of NMOS and PMOS tran- Number of LUT Inputs Number of LUT Inputs sistors, and the MTJ parameters presented in Table 1 based on ITRS’09 projections. All results are obtained under iden- Figure 7: Access energy, leakage power, read delay, and area tical input vectors, minimum transistor sizing, and a 370K of a single LUT as a function of the number of LUT inputs temperature. Although simulations were also performed at based on Cadence-Spectre circuit simulations at 32nm. 16nm and 22nm nodes, results showed similar tendencies to those presented here, and are not repeated. Read Energy. Access energy decreases slightly as LUT STT-MRAM Static ROM-Based sizes are increased. Although there are more internal nodes— Parameter LUT CMOS LUT and thus, higher gate and drain capacitances–to charge with Delay 100ps 110ps 190ps each access on a larger LUT, the voltage swing on the footer Access Energy 7.43fJ 11.1fJ 27.4fJ Leakage Power 1.77nW 10.18nW 514nW capacitor is lower due to the increased series resistance charg- Area 2.40µm2 0.43µm2 17.9µm2 ing the capacitor. As a design choice, it is possible to size up the transistors in the decode tree to trade off power against latency and area. The overall access energy goes down from Table 2: Comparison of three-bit adder implementations 2fJ to 1.7fJ as LUT size is increased from two to six for the using STT-MRAM LUTs, static CMOS, and a static CMOS minimum-size transistors used in these simulations. ROM. Area estimates do not include wiring overhead. Static CMOS. A three-bit CMOS ripple-carry adder is Leakage Power. Possible dominant leakage paths for the built using one half-adder (HAX1) and two full-adder (FAX1) LUT circuit are: (1) from Vdd through the PMOS keeper circuits based on circuit topologies used in the OSU stan- transistors into the capacitor, (2) from Vdd through the dard cell library [29]. Static CMOS offers the smallest area footer charge/discharge NMOS to GND, and (3) the sense among all three designs considered, since the layout is highly amplifier. Lower values of leakage power are observed at regular and only 70 transistors are required instead of 348, higher LUT sizes due to higher resistance along leakage which is the case of the STT-MRAM LUT-based design. paths (1) and (2), and due to the stack effect of the transis- Leakage is 5.8× higher than MRAM since the CMOS im- tors in the 3 × 8 decode tree. However, similarly to the case plementation has a much higher number of leakage paths of read energy, sizing the decoder transistors appropriately than an STT-MRAM LUT, whose subthreshold leakage is to trade-off speed against energy can change this balance. confined to its peripheral circuitry. As LUT size is increased from two to six inputs, leakage power reduces from 550pW to 400pW. STT-MRAM LUTs. A three-input half-adder requires four STT-MRAM LUTs, one for each output of the adder Latency. Due to the increased series resistance of the de- (three sum bits plus a carry-out bit). Since the least signifi- coder’s pull-down network with larger LUTs, the RC time cant bit of the sum depends only on two bits, it can be calcu- constant associated with charging the footer capacitor goes late using a two-input LUT. Similarly, the second bit of the up, and latency increases from 80 to 100ps. However, LUT sum depends on a total of four bits, and can be implemented speed can be increased by sizing the decoder transistors using a four-input LUT. The most significant bit and the higher, at the expense of larger area, and a higher load ca- carry-out bit each depend on six bits, and each of them re- pacitance for the previous stage driving the LUT. For opti- quires a six-input LUT. Although results presented here are mal results, the footer capacitor must also be sized appropri- based on unoptimized, minimum-size STT-MRAM LUTs, it ately. A higher capacitance allows the circuit to work with a is possible to slow down the two- and four-input LUTs to lower voltage swing at the expense of increased area. Lower save access energy by sizing their transistors. The results capacitance values cause higher voltage swings on the ca- presented here are conservative compared to this best-case pacitor, thereby slowing down the reaction time of the sense optimization scenario. amplifier due to the lower potential difference between the An STT-MRAM based three-bit adder has 1.5× lower ac- DEC and REF nodes. A 50fF capacitor was used in these cess energy than its static CMOS counterpart due to its simulations. energy-efficient, low-swing, differential current-mode logic implementation; however, these energy savings are achieved Area. Although larger LUTs amortize the leakage power of at the expense of a 5.6× increase in area. In a three-bit the peripheral circuitry better, and offer more functionality adder, a six-input STT-MRAM LUT replaces three CMOS without incurring a large latency penalty, the area overhead standard cells. Area overhead can be expected to be lower 375 when implementing more complex logic functions that re- LUT based circuits improve dramatically. This observation quire the realization of many minterms, which is when LUT- that LUT-based implementations work significantly better based computation is most beneficial; for instance, a single for large and complex circuits is one of our guidelines for six-input LUT is expected to replace 12 CMOS standard choosing which parts of a microprocessor should be imple- cells on average when implementing the FPU (Section 4.5) mented using LUTs vs. conventional CMOS. and the memory controller scheduling logic (Section 4.6.4). The most notable advantage of the STT-MRAM LUT over static CMOS is the 5.8× reduction in leakage. This is due 4. STRUCTURE AND OPERATION OF AN to the significantly smaller number of leakage paths that are possible with an STT-MRAM LUT, which exhibits sub- STT-MRAM BASED CMT PIPELINE threshold leakage only through its peripheral circuitry. The Figure 8 shows how hardware resources are partitioned speed of the STT-MRAM LUT is similar to static CMOS: between CMOS and STT-MRAM in an example CMT sys- although CMOS uses higher-speed standard cells, an STT- tem with eight single-issue in-order cores, and eight hard- MRAM LUT calculates all four bits in parallel using inde- ware thread contexts per core. Whether a resource can be pendent LUTs. effectively implemented in STT-MRAM depends on both its size and on the expected number of writes it incurs per CMOS ROM-Based LUTs. To perform a head-on com- cycle. STT-MRAM offers dramatically lower leakage and parison against a LUT-based CMOS adder, we build a 64×4 much higher density than SRAM, but suffers from long write static ROM circuit that can read all three bits of the sum latency and high write energy. Large, wire-delay dominated and the carry-out bit with a single lookup. Compared to RAM arrays—L1 and L2 caches, TLBs, memory controller a 6T-SRAM based, reconfigurable LUT used in an FPGA, queues, and register files—are implemented in STT-MRAM a ROM-based, fixed-function LUT is more energy efficient, to reduce leakage and interconnect power, and to improve since each table entry requires either a single transistor (in interconnect delay. Instruction and store buffers, PC reg- the case of a logic 1) or no transistors at all (in the case isters, and pipeline latches are kept in CMOS due to their of a logic 0), rather than the six transistors required by an small size and relatively high write activity. Since LUTs SRAM cell. A 6-to-64 decoder drives one of 64 wordlines, are never written at runtime, they are used to implement which activates the transistors on cells representing a logic such complex combinational logic blocks as the front-end 1. A minimum sized PMOS pull-up transistor and a skewed thread selection, decode, and next-PC generation logic, the inverter are employed to sense the stored logic value. Four floating-point unit, and the memory controller’s scheduling parallel bitlines are used for the four outputs of the adder, logic. amortizing dynamic energy and leakage power of the decoder An important issue that affects both power and perfor- over four output bits. mance for caches, TLBs, and register files is the size of a ba- The ROM-based LUT dissipates 290× higher leakage than sic STT-MRAM cell used to implement the subarrays. With its STT-MRAM based counterpart. This is due to two fac- 30F 2 cells, write latency can be reduced by 2.2× over 10F 2 tors: (1) transistors in the decoder circuit of the ROM rep- cells (Section 2.1) at the expense of lower density, higher resent a significant source of subthreshold leakage, whereas read energy, and longer read latency. Lookup tables are con- the STT-MRAM LUT uses differential current-mode logic, structed from dense, 10F 2 cells as they are never written at which connects a number of access devices in series with runtime. The register file and the L1 d-cache use 30F 2 cells each MTJ on a decode tree, without any direct connections with 3.1ns switching time as the 6.7ns write occupancy of a between the access devices and Vdd, and (2) the ROM- 10F 2 cell has a prohibitive impact on throughput. The L2 based readout mechanism suffers from significant leakage cache and the memory controller queues are implemented paths within the data array itself, since all unselected de- vices represent sneak paths for active leakage during each with 10F 2 cells and are optimized for density and power access. The access energy of the ROM-based LUT is 3.7× rather than write speed; similarly, TLBs and the L1 i-cache higher than the STT-MRAM LUT, since (1) the decoder are implemented using 10F 2 cells due to their relatively low has to be activated with every access, and (2) the bitlines miss rate, and thus, low write probability. are charged to Vdd and discharged to GND using full-swing voltages, whereas the differential current-sensing mechanism 4.1 Instruction Fetch of the STT-MRAM LUT operates with low-swing voltages. Each core’s front-end is quite typical, with a separate The ROM-based LUT also runs 1.9× slower than its STT- PC register and an eight-deep instruction buffer per thread. MRAM based counterpart due to the serialization of the The i-TLB, i-cache, next-PC generation logic, and front-end decoder access and cell readout: the input signal has to tra- thread selection logic are shared among all eight threads. verse through the decoder to activate one of the wordlines, The i-TLB and the i-cache are built using STT-MRAM ar- which then selects the transistors along that wordline. Two rays; thread selection and next-PC generation logic are im- thirds of the delay is incurred in the decoder. Overall, the plemented with STT-MRAM LUTs. Due to their small size ROM-based LUT delivers the worst results on all metrics and high write activity, instruction buffers and PC registers considered due to its inherently more complex and leakage- are left in CMOS. prone design. 3.2.3 Deciding When to Use LUTs 4.1.1 Program Counter Generation Each thread has a dedicated, CMOS-based PC register. Consider a three-bit adder which has two three-bit inputs To compute the next sequential PC with minimum power and four one-bit outputs. This function can be implemented and area overhead, a special 6 × 7 “add one” LUT is used using four six-input LUTs, whereas the VLSI implementa- rather than a general-purpose adder LUT. A 6 × 7 LUT tion requires only three standard cells, resulting in a stdcell LU T accepts six bits of the current PC plus a carry-in bit to ratio of less than one. On the other hand, an unsigned multi- calculate the corresponding six bits of the next PC and a plier with two three-bit inputs and a six-bit output requires carry-out bit; internally, the circuit consists of two-, three-, six six-input LUTs or 36 standard cells, raising the same four-, five-, and six-input LUTs (one of each), each of which ratio to six. As the size and complexity of a Boolean func- computes a different bit of the seven bit output in parallel. tion increases, thereby requiring more minterms after logic The overall next sequential PC computation unit com- minimization, this ratio can be as high as 12 [5]. This is due prises five such 6 × 7 LUTs arranged in a carry-select con- not only to the increased complexity of the function better figuration (Figure 9). Carry out bits are used as the select utilizing the fixed size of the LUTs, but also to the sheer signals for a chain of CMOS-based multiplexers that choose size of the circuit allowing the boolean minimizer to amor- either the new or the original six bits of the PC. Hence, the tize complex functions over multiple LUTs. As this ratio delay of the PC generation logic is four multiplexer delays, gets higher, power consumption and leakage advantage of plus a single six-input LUT delay, which comfortably fits 376 Write Back Memory Pure CMOS STT-MRAM LUTs RegFile STT-MRAM Arrays x8 CLK Instruction Pre Thread CLK Fetch Decode Select Func Unit Thrd I$ Inst Thrd ALU D$ MC0 Queue PC Decode Sel Buf Sel StBuf MC0 Logic Logic Logic Crossbar Mux I-TLB x8 Mux FPU x8 D-TLB Interface MC1 Queue CLK CLK CLK CLK CLK CLK Shared MC1 Logic L2$ • I$ Miss • D$ Miss MC2 Queue Front-End Back-End Banks x 8 Thrd Sel • I-TLB Miss Thrd Sel • D-TLB Miss MC2 Logic Logic • Inst Buf Full Logic • Dependence MC3 Queue • Branch • Structure Conflict MC3 Logic Decode Execute Figure 8: Illustrative example of a resistive CMT pipeline. within a 250ps clock period in PTM-based circuit simula- ately if the write port is available; otherwise, it is placed in tions (Section 6). the refill queue while it waits for the write port to free up. 31 31 26 26 20 20 14 14 88 2 2 SRAM STT-MRAM STT-MRAM PC Parameter (32KB) (32KB) (128KB) Read Delay 397ps 238ps 474ps Write Delay 397ps 6932ps 7036ps 6x7 LUT-64 6x7 LUT-64 6x7 LUT-64 6x7 LUT-64 6x7 LUT-64 Read Energy 35pJ 13pJ 50pJ LUT LUT LUT LUT LUT Write Energy 35pJ 90pJ 127pJ Leakage Power 75.7mW 6.6mW 41.4mW cout Area 0.31mm2 0.06mm2 0.26mm2 cout cout cout cout cout cout cout Next PC Table 3: Instruction cache parameters. Figure 9: Next PC generation using five add-one LUTS in It is possible to leverage the 14.6× density advantage a carry-select configuration. of STT-MRAM over SRAM by either designing a similar- capacity L1 i-cache with shorter wire delays, lower read en- ergy, and lower area and leakage, or by designing a higher- 4.1.2 Front-End Thread Selection capacity cache with similar read latency and read energy Every cycle, the front-end selects one of the available under a similar area budget. Table 3 presents latency, power, threads to fetch in round-robin order, which promotes fair- and area comparisons between a 32KB, SRAM-based i-cache; ness and facilitates a simple implementation. The following its 32KB, STT-MRAM counterpart; and a larger, 128KB conditions make a thread unselectable in the front-end: (1) STT-MRAM configuration that fits under the same area an i-cache or an i-TLB miss, (2) a full instruction buffer, or budget 2 . Simply migrating the 32KB i-cache from SRAM (3) a branch or jump instruction. On an i-cache or an i-TLB to STT-MRAM reduces area by 5.2×, leakage by 11.5×, miss, the thread is marked unselectable for fetch, and is re- read energy by 2.7×, and read delay by one cycle at 4GHz. set to a selectable state when the refill of the i-cache or the Leveraging the density advantage to build a larger, 128KB i-TLB is complete. To facilitate front-end thread selection, cache results in more modest savings in leakage (45%) due the ID of the last selected thread is kept in a three-bit CMOS to the higher overhead of the CMOS-based peripheral cir- register, and the next thread to fetch from is determined as cuitry. Write energy increases by 2.6 − 3.6× over CMOS the next available, ublocked thread in round-robin order. with 32KB and 128KB STT-MRAM caches, respectively. The complete thread selection mechanism thus requires an 11-to-3 LUT, which is built from 96 six-input LUTs sharing 4.2 Predecode a data bus with tri-state buffers—six bits of the input are After fetch, instructions go through a predecode stage sent to all LUTs, and the remaining five bits are used to where a set of predecode bits for back-end thread selection generate the enable signals for all LUTs in parallel with the are extracted and written into the CMOS-based instruction LUT access. (It is also possible to optimize for power by buffer. Predecode bits indicate if the instruction is a mem- serializing the decoding of the five bits with the LUT ac- ber of the following equivalence classes: (1) a load or a store, cess, and by using the enable signal to control the LUT clk (2) a floating-point or integer divide, (3) a floating-point input.) add/sub, compare, multiply, or an integer multiply, (4) a brach or a jump, or (5) any other ALU operations. Each 4.1.3 L1 Instruction Cache and TLB flag is generated by inspecting the six-bit opcode, which The i-cache and and the i-TLB are both implemented in requires a total of five six-input LUTs. The subbank ID STT-MRAM due to their large size and relatively low write of the destination register is also extracted and recorded in activity. Since writes are infrequent, these resources are each the instruction buffer during the predecode stage to faciliate organized into a single subbank to minimize the overhead of back-end thread selection. the peripheral circuitry, and are built using 10F 2 cells that reduce area, read energy, and read latency at the expense of 4.3 Thread Select longer writes. The i-cache is designed with a dedicated read Every cycle, the back-end thread selection unit issues an port and a dedicated write port to ensure that the front-end instruction from one of the available, unblocked threads. does not come to a complete stall during refills; this ensures The goal is to derive a correct and balanced issue sched- that threads can still fetch from the read port in the shadow ule that prevents out-of-order completion; avoids structural of an ongoing write. To accommodate multiple outstanding hazards and conflicts on L1 d-cache and register file sub- misses from different threads, the i-cache is augmented with banks; maintains fairness; and delivers high throughput. an eight-entry refill queue. When a block returns from the 2 L2 on an i-cache miss, it starts writing to the cache immedi- The experimental setup is described in Section 5. 377 4.3.1 Instruction Buffer too much area, leakage, or latency overhead (Section 3.1). Each thread has a private, eight-deep instruction buffer Mapping each thread’s integer and floating-point registers organized as a FIFO queue. Since buffers are small and are to a common subbank would significantly degrade through- written every few cycles with up to four new instructions, put when a single thread is running in the system, or during they are implemented in CMOS as opposed to STT-MRAM. periods where only a few threads are schedulable due to L1 d-cache misses. To avert this problem, each thread’s regis- 4.3.2 Back-End Thread Selection Logic ters are are striped across consecutive subbanks to improve Every cycle, back-end thread selection logic issues the in- throughput and to minimize the chance of a subbank write struction at the head of one of the instruction buffers to be port conflict. Double-precision floating-point operations re- decoded and executed. The following events make a thread quire reading two consecutive floating-point registers start- unschedulable: (1) an L1 d-cache or d-TLB miss, (2) a struc- ing with an even-numbered register, which is accomplished tural hazard on a register file subbank, (3) a store buffer by accessing two consecutive subbanks and driving the 64- overflow, (4) a data dependency on an ongoing long-latency bit data bus in parallel. floating-point, integer multiply, or integer divide instruction, T0-R0 T0-R1 T0-R2 T0-R3 (5) a structural hazard on the (unpipelined) floating-point T1-R0 T1-R1 T1-R2 T1-R3 divider, and (6) the possibility of out-of-order completion. A load’s buffer entry is not recycled at the time the load issues; instead, the entry is retained until the load is known T0-R4 T0-R5 T0-R6 T0-R7 T1-R4 T1-R5 T1-R6 T1-R7 to hit in the L1 d-cache or in the store buffer. In the case of a miss, the thread is marked as unschedulable; when the L1 d-cache refill process starts, the thread transitions to a schedulable state, and the load is replayed from the instruc- Shared Data and Address Busses tion buffer. On a hit, the load’s instruction buffer entry is recycled as soon as the load enters the writeback stage. Long-latency floating-point instructions and integer mul- Figure 10: Illustrative example of a subbanked register file. tiplies from a single thread can be scheduled back-to-back so Table 4 lists area, read energy, and leakage power ad- long as there are no dependencies between them. In the case vantages that are possible by implementing the register file of an out-of-order completion possibility—a floating-point in STT-MRAM. The STT-MRAM implementation reduces divide followed by any other instruction, or any floating- leakage by 2.4× and read energy by 1.4× over CMOS; how- point instruction other than a divide followed by an integer ever, energy for a full 32-bit write is increased by 22.2×. instruction—, the offending thread is made unschedulable Whether the end result turns out to be a net power savings for as many cycles as needed for the danger to disappear. depends on how frequently the register file is updated, and Threads can also become unschedulable due to structural on how effective differential writes are on a given workload. hazards on the unpipelined floating-point divider, on reg- ister file subbank write ports, or on store buffers. As the Parameter SRAM STT-MRAM register file is built using 30F 2 STT-MRAM cells with 3.1ns Read Delay 137ps 122ps switching time, the register file subbank write occupancy is Write Delay 137ps 3231ps 13 cycles at 4GHz. Throughout the duration of an on-going Read Energy 0.45pJ 0.33pJ write, the subbank is unavailable for a new write (unless Write Energy 0.45pJ 10.0pJ Leakage Power 3.71mW 1.53mW it is the same register that is being overwritten), but the Area 0.038mm2 0.042mm2 read ports remain available; hence, register file reads are not stalled by long-latency writes. If the destination sub- bank of an instruction conflicts with an ongoing write to Table 4: Register file parameters. the same bank, the thread becomes unschedulable until the 4.5 Execute target subbank is available. If the head of the instruction After decode, instructions are sent to functional units to buffer is a store and the store buffer of the thread is full, the complete their execution. Bitwise logical operations, inte- thread becomes unschedulable until there is an opening in ger addition and subtraction, and logical shifts are handled the store buffer. by the integer ALU, whereas floating-point addition, mul- In order to avoid starvation, a least recently selected (LRS) tiplication, and division are handled by the floating-point policy is used to pick among all schedulable threads. The unit. Similar to Sun’s Niagara-1 processor [17], integer mul- LRS policy is implemented using CMOS gates. tiply and divide operations are also sent to the FPU rather than a dedicated integer multiplier to save area and leakage 4.4 Decode power. Although the integer ALU is responsible for 5% of In the decode stage, the six-bit opcode of the instruction the baseline leakage power consumption, many of the opera- is inspected to generate internal control signals for the fol- tions it supports (e.g., bitwise logical operations) do not have lowing stages of the pipeline, and the architectural register enough circuit complexity (i.e., minterms) to amortize the file is accessed to read the input operands. Every decoded peripheral circuitry in a LUT-based implementation. More- signal propagated to the execution stage thus requires a six- over, operating an STT-MRAM based integer adder (the input LUT. For a typical, five-stage MIPS pipeline [15] with 16 output control signals, 16 six-input LUTs suffice to ac- power- and area-limiting unit in a typical integer ALU [28]) complish this. at single-cycle throughput requires the adder to be pipelined in two stages, but the additional power overhead of the 4.4.1 Register File pipeline flip-flops largely offsets the benefits of transition- ing to STT-MRAM. Consequently, the integer ALU is left Every thread has 32 integer registers and 32 floating-point in CMOS. The FPU, on the other hand, is responsible for registers, for a total of 512 registers (2kB of storage) per core. a large fraction of the per-core leakage power and dynamic To enable a high-performance, low-leakage, STT-MRAM access energy, and is thus implemented with STT-MRAM based register file that can deliver the necessary write through- LUTs. put and single-thread latency, integer and floating-point reg- ister from all threads are aggregated in a subbanked STT- Floating-Point Unit. To compare ASIC- and LUT-based MRAM array as shown in Figure 10. The overall register implementations of the floating-point unit, an industrial FPU file consists of 32 subbanks of 16 registers each, sharing a design from Gaisler Research, the GRFPU [5], is taken as a common address bus and a 64-bit data bus. The register baseline. A VHDL implementation of the GRFPU synthe- file has two read ports and a write port, and the write ports sizes to 100,000 gates on an ASIC design flow, and runs at are augmented with subbank buffers to allow multiple writes 250MHz at 130nm; on a Xilinx Virtex-2 FPGA, the unit syn- to proceed in parallel on different subbanks without adding thesizes to 8,500 LUTs, and runs at 65MHz. Floating-point 378 addition, subtraction, and multiplication are fully pipelined the scheduling of stores and to minimize the performance and execute with a three-cycle latency; floating-point divi- impact of contention on subbank write ports, each thread sion is unpipelined and takes 16 cycles. is allocated a CMOS-based, eight-deep store buffer holding To estimate the required pipeline depth for an STT-MRAM in-flight store instructions. LUT-based implementation of the GRFPU to operate at 4GHz at 32nm, we use published numbers on configurable 4.6.1 Store Buffers logic block (CLB) delays on a Virtex-2 FPGA [2]. A CLB One problem that comes up when scheduling stores is the has a LUT+MUX delay of 630ps and an interconnect delay possibility of a d-cache subbank conflict at the time the of 1 to 2ns based on its placement, which corresponds to a store reaches the memory stage. Since stores require address critical path of six to ten CLB delays. For STT-MRAM, we computation before their target d-cache subbank is known, assume a critical path delay of eight LUTs, which represents thread selection logic cannot determine if a store will expe- the average of these two extremes. Assuming a buffered rience a port conflict in advance. To address this problem, six-input STT-MRAM LUT delay of 130ps and a flip-flop the memory stage of the pipeline includes a CMOS-based, sequencing overhead (tsetup + tC→Q ) of 50ps, and conserva- private, eight-deep store buffer per thread. So long as a tively assuming a perfectly-balanced pipeline for the base- thread’s store buffer is not full, the thread selection logic can line GRFPU, we estimate that the STT-MRAM implemen- schedule the store without knowing the destination subbank. tation would need to be pipelined eight times deeper than Stores are dispatched into and issued from store buffers in the original to operate at 4GHz, with floating-point addi- FIFO order; store buffers also provide an associative search tion, subtraction, and multiplication latencies of 24 cycles, port to support store-to-load forwarding, similar to Sun’s and an unpipelined, 64-cycle floating-point divide latency. Niagara-1 processor. We assume relaxed consistency mod- When calculating leakage power, area, and access energy, els where special synchronization primitives (e.g., memory we account for the overhead of the increased number of flip- fences in weak consistency, or acquire/release operations in flops due to this deeper pipeline (flip-flop power, area, and release consistency) are inserted into store buffers, and the speed are extracted from 32nm circuit simulations of the store buffer enforces the semantics of the primitives when topology used in the OSU standard cell library [29]). We retiring stores and when forwarding to loads. Since the L1 characterize and account for the impact of loading on an d-cache supports a single write port (but multiple subbank STT-MRAM LUT when driving another LUT stage or a flip-flop via Cadence-Spectre circuit simulations. buffers), only a single store can issue per cycle. Store buffers, To estimate pipeline depth for the CMOS implementation and the L1 refill queue contend for access to this shared re- of the GRFPU running at 4GHz, we first scale the baseline source, and priority is determined based on a round-robin 250MHz frequency linearly from 130nm to 32nm, which cor- policy. responds to an intrinsic frequency of 1GHz at 32nm. Thus, conservatively ignoring the sequencing overhead, to operate 4.6.2 L1 Data Cache and TLB at 4GHz, the circuit needs to be pipelined 4× deeper, with Both the L1 d-cache and the d-TLB are implemented us- 12-cycle floating-point addition, subtraction, and multipli- ing STT-MRAM arrays. The d-cache is equipped with two cation latencies, and a 64-cycle, unpipelined floating-point read ports (one for snooping, and one for the core) and a divide. Estimating power for CMOS (100,000 gates) re- write port shared among all subbanks. At the time a load is- quires estimating dynamic and leakage power for an average sues, the corresponding thread is marked unschedulable and gate in a standard-cell library. We characterize the following recycling of the instruction buffer entry holding the load is OSU standard cells using circuit simulations at 32nm, and postponed until it is ascertained that the load will not expe- use their average to estimate power for the CMOS-based rience a d-cache miss. Loads search the store buffer of the GRFPU design: INVX2, NAND2X1, NAND3X1, BUFX2, corresponding thread and access the L1 d-cache in parallel, BUFX4, AOI22X1, MUX2X1, DFFPOSX1, and XNORX1. and forward from the store buffer in the case of a hit. On Table 5 shows the estimated leakage, dynamic energy, and a d-cache miss, the thread is marked unschedulable, and is area of the GRFPU in both pure CMOS and STT-MRAM. transitioned back to a schedulable state once the data ar- The CMOS implementation uses 100, 000 gates whereas the rives. To accommodate refills returning from the L2, the STT-MRAM implementation uses 8,500 LUTs. Although L1 has a 16-deep, CMOS-based refill queue holding incom- each CMOS gate has lower dynamic energy than a six-input ing data blocks. Store buffers and the refill queue contend LUT, each LUT can replace 12 logic gates on average. This for access to the two subbanks of the L1, and are given 12× reduction in unit count results in an overall reduction access using a round-robin policy. Since the L1 is written of the total dynamic energy. Similarly, although each LUT frequently, it is optimized for write throughput using 10F 2 has higher leakage than a CMOS gate, the cumulative leak- cells. L1 subbank buffers perform internal differential writes age of 8,500 LUTs reduces leakage by 4× over the combined to reduce write energy. leakage of 100, 000 gates. Area, on the other hand, is com- parable due to the reduced unit count compensating for the SRAM STT-MRAM STT-MRAM 5× higher area of each LUT and the additional buffering Parameter (32KB) (32KB, 30F 2 ) (64KB, 30F 2 ) required to cascade the LUTs. (Note that these area es- Read Delay 344ps 236ps 369ps Write Delay 344ps 3331ps 3399ps timates do not account for wiring overheads in either the Read Energy 60pJ 31pJ 53pJ CMOS or the STT-MRAM implementations.) In summary, Write Energy 60pJ 109pJ 131pJ the FPU is a good candidate to place in STT-MRAM since Leakage Power 78.4mW 11.0mW 31.3mW its high circuit complexity produces logic functions with Area 0.54mm2 0.19mm2 0.39mm2 many minterms that require many CMOS gates to imple- ment, which is exactly when a LUT-based implementation Table 6: L1 d-cache parameters. is advantageous. Parameter CMOS FPU STT-MRAM FPU Table 6 compares the power, area, and latency character- Dynamic Energy 36pJ 26.7pJ istics of two different STT-MRAM based L1 configurations Leakage Power 259mW 61mW to a baseline, 32KB CMOS implementation. A capacity- Area 0.22mm2 0.20mm2 equivalent, 32KB d-cache reduces access latency from two clock cycles to one, and cuts down the read energy by 1.9× Table 5: FPU parameters. Area estimates do not include due to shorter interconnect lengths possible with the den- wiring overhead. sity advantage of STT-MRAM. Leakage power is reduced by 7.1×, and area is reduced by 2.8×. An alternative, 64kB configuration requires 72% of the area of the CMOS base- 4.6 Memory line, but increases capacity by 2×; this configuration takes In the memory stage, load and store instructions access two cycles to read, and delivers a 2.5× leakage reduction the STT-MRAM based L1 d-cache and d-TLB. To simplify over CMOS. 379 4.6.3 L2 Cache ister file subbank conflicts into account. Differential writes The L2 cache is designed using 10F 2 STT-MRAM cells within the register file reduce write power during write backs. to optimize for density and access energy rather than write speed. To ensure adequate throughput, the cache is equipped with eight banks, each of which supports four subbanks, for 5. EXPERIMENTAL SETUP a total of 32. Each L2 bank has a single read/write port We use a heavily modified version of the SESC simula- shared among all subbanks; unlike the L1 d-cache and the tor [25] to model a Niagara-like in-order CMT system with register file, L2 subbanks are not equipped with differen- eight cores, and eight hardware thread contexts per core. tial writing circuitry to minimize leakage due to the CMOS- Table 9 lists the microarchitectural configuration of the base- based periphery. line cores and the shared memory subsystem. Table 7 compares two different STT-MRAM L2 organiza- tions to a baseline, 4MB CMOS L2. To optimize for leak- Processor Parameters age, the baseline CMOS L2 cache uses high-Vt transistors Frequency 4 GHz Number of cores 8 in the data array, whereas the peripheral circuitry needs Number of SMT contexts 8 per core to be implemented using low-Vt, high-performance transis- Front-end thread select Round Robin tors to maintain a 4GHz cycle time. A capacity-equivalent, Back-end thread select Least Recently Selected 4MB STT-MRAM based L2 reduces leakage by 2.0× and Pipeline organization Single-issure, in-order Store buffer entries 8 per thread read access energy by 63% compared to a CMOS baseline. L1 Caches Alternatively, it is possible to increase capacity to 32MB iL1/dL1 size 32kB/32kB while maintaining lower area, but the leakage overhead of iL1/dL1 block size 32B/32B the peripheral circuitry increases with capacity, and results iL1/dL1 round-trip latency 2/2 cycles(uncontended) in twice as much leakage as the baseline. iL1/dL1 ports 1/2 iL1/dL1 banks 1/2 iL1/dL1 MSHR entries 16/16 SRAM STT-MRAM STT-MRAM iL1/dL1 associativity direct mapped/2-way Parameter (4MB) (4MB) (32MB) Coherence protocol MESI Read Delay 2364ps 1956ps 2760ps Consistency model Release consistency Write Delay 2364ps 7752ps 8387ps Shared L2 Cache and Main Memory Read Energy 1268pJ 798pJ 1322pJ Shared L2 cache 4MB, 64B block, 8-way Write Energy 1268pJ 952pJ 1477pJ L2 MSHR entries 64 Leakage Power 6578mW 3343mW 12489mW L2 round-trip latency 10 cycles (uncontended) Area 82.33mm2 32.00mm2 70.45mm2 Write buffer 64 entries DRAM subsystem DDR2-800 SDRAM [21] Memory controllers 4 Table 7: L2 cache parameters. Table 9: Parameters of baseline. 4.6.4 Memory Controllers To provide adequate memory bandwidth to eight cores, For STT-MRAM, we experiment with two different design the system is equipped with four DDR2-800 memory con- points for L1 and L2 caches: (1) configurations with capac- trollers. Memory controller read and write queues are im- ity equivalent to the CMOS baseline, where STT-MRAM plemented in STT-MRAM using 10F 2 cells. Since the con- enjoys the benefits of lower interconnect delays (Table 10- troller needs to make decisions only every DRAM clock cycle Small), and (2) configurations with larger capacity that still (10 processor cycles in our baseline), the impact of write la- fit in under same area budget as the CMOS baseline, where tency on scheduling efficiency and performance is negligible. STT-MRAM benefits from fewer misses (Table 10-Large). The controller’s scheduling logic is implemented using STT- STT-MRAM memory controller queue write delay is set to MRAM LUTs. To estimate power, performance, and area 27 processor cycles. We experiment with an MRAM-based under CMOS- and MRAM-based implementations, we use a register file with 32 subbanks and a write delay of 13 cy- methodology similar to that employed for the floating-point cles each, and we also evaluate the possibility of leaving the unit. We use a DDR2-800 memory controller IP core de- register file in CMOS. veloped by HiTech [11] as our baseline; on an ASIC design To derive latency, power, and area figures for STT-MRAM flow, the controller synthesizes to 13, 700 gates and runs at arrays, we use a modified version of CACTI 6.5 [23] aug- 400MHz; on a Xilinx Virtex-5 FPGA, the same controller mented with 10F 2 and 30F 2 STT-MRAM cell models. We synthesizes to 920 CLBs and runs at 333MHz. Replacing use BSIM-4 predictive technology models (PTM) of NMOS CLB delays with STT-MRAM LUT delays, we find that an and PMOS transistors at 32nm, and perform circuit simu- STT-MRAM based implementation of the controller would lations using Cadence AMS (Spectre) mixed signal analyses meet the 400MHz cycle time without further modifications. with Verilog-based input test vectors. Only high perfor- Table 8 compares the parameters of the CMOS and STT- mance transistors were used in all circuit simulations. Tem- MRAM based implementations. Similarly to the case of the perature is set to 370K in all cases, which is a meaningful FPU, the controller logic benefits significantly from a LUT thermal design point for the proposed processor operating based design. Leakage power is reduced by 7.2×, while the at 4GHz [24]. energy of writing to the scheduling queue increases by 24.4×. For structures that reside in CMOS in both the baseline and the proposed architecture (e.g., pipeline latches, store Parameter CMOS STT-MRAM buffers), McPAT [19] is used to estimate power, area, and Read Delay 185ps 154ps latency. Write Delay 185ps 6830ps Read Energy 7.1pJ 5.6pJ Write Energy MC Logic Energy 7.1pJ 30.0pJ 173pJ 1.6pJ 6. EVALUATION Leakage Power 41.4mW 5.72mW Area 0.097mm2 0.051mm2 6.1 Performance Figure 11 compares the performance of four different MRAM- Table 8: Memory controller parameters. Area estimates do based CMT configurations to the CMOS baseline. When not include wiring overhead. the register file is placed in STT-MRAM, and the L1 and L2 cache capacities are made equivalent to CMOS, perfor- mance degrades by 11%. Moving the register file to CMOS 4.7 Write Back improves performance, at which point the system achieves In the write-back stage, an instruction writes its result 93% of the baseline performance. Enlarging both L1 and L2 back into the architectural register file through the write cache capacities under the same area budget reduces miss port. No conflicts are possible during this stage since the rates but loses the latency advantage of the smaller caches; thread selection logic schedules instructions by taking reg- this configuration outperforms CMOS by 2% on average. 380 Performance Normalized 1.2 1 to CMOS 0.8 0.6 0.4 0.2 0 BLAST BSOM CG CHOLESKY EQUAKE FFT KMEANS LU MG OCEAN RADIX SWIM WATER‐N GEOMEAN CMOS Small L1&L2, STT‐MRAM RF Small L1&L2, CMOS RF Large L1&L2, CMOS RF Small L1, Large L2, CMOS RF Figure 11: Performance Total Power (W) 20.0 15.0 10.0 5.0 0.0 BLAST BSOM CG CHOLESKY EQUAKE FFT KMEANS LU MG OCEAN RADIX SWIM WATER‐N AVERAGE CMOS Small L1&L2, STT‐MRAM RF Small L1&L2, CMOS RF Large L1&L2, CMOS RF Small L1, Large L2, CMOS RF Figure 12: Total Power Small Large iL1/dL1 size 32kB/32kB 128kB/64kB systems. STT-MRAM configurations that maintain the same iL1/dL1 latency 1/1 cycles 2/2 cycles cache sizes as CMOS reduce total power by 1.7× over CMOS. L1s write occupancy 13 cycles 13 cycles Despite their higher performance potential, configurations L2 size 4MB 32MB which increase cache capacity under the same area budget L2 latency 8 cycles 12 cycles L2 write occupancy 24 cycles 23 cycles increase power by 1.2× over CMOS, due to the significant amount of leakage power dissipated in the CMOS-based de- coding and sensing circuitry in the 32MB L2 cache. Al- Table 10: STT-MRAM caches parameters though a larger L2 can reduce write power by allowing for fewer L2 refills and writes to memory controllers’ scheduling Benchmark Description Problem size queues, the increased leakage power consumed by the pe- Data Mining ripheral circuitry outweighs the savings on dynamic power. BLAST Protein matching 12.3k sequences Figure 13 shows the breakdown of leakage power across BSOM Self-organizing map 2,048 rec., 100 epochs KMEANS K-means clustering 18k pts., 18 attributes different components for all evaluated systems. Total leak- NAS OpenMP age power is reduced by 2.1× over CMOS when cache ca- MG Multigrid Solver Class A pacities are kept the same. Systems with a large L2 cache CG Conjugate Gradient Class A increase leakage power by 1.3× due to the CMOS-based pe- SPEC OpenMP riphery. The floating-point units, which consume 18% of the SWIM Shallow water model MinneSpec-Large total leakage power in the CMOS baseline, benefit signifi- EQUAKE Earthquake model MinneSpec-Large cantly from an STT-MRAM based implementation. STT- Splash-2 Kernels MRAM based L1 caches and TLBs together reduce leakage CHOLESKY Cholesky factorization tk29.O FFT Fast Fourier transform 1M points power by another 10%. The leakage power of the memory LU Dense matrix division 512 × 512to16 × 16 controllers in STT-MRAM is negligible, whereas in CMOS RADIX Integer radix sort 2M integers it is 1.5%. Splash-2 Applications OCEAN Ocean movements 514×514 ocean 14.92 WATER-N Water-Nsquared 512 molecules 14.48 RF 14 FPU Table 11: Simulated applications and their input sizes. 12 11.40 Leakage Power (W) ALU and Bypass 10 Optimizing the L2 for fewer misses (by increasing capacity 8 InstBuf and STQ under a the same area budget) while optimizing the L1s for FFs and Comb Logic 6 fast hits (by migrating to a denser, STT-MRAM cache with 5.32 5.34 L1s and TLBs same capacity) delivers similar results. 4 In general, performance bottlenecks are application de- 2 L2 pendent. For applications like CG, FFT and WATER, the MC MRAM-based register file represents the biggest performance 0 hurdle. These applications encounter a higher number of CMOS Small L1 and Small L1 and Large L1 and Small L1, Large L2, STT‐MRAM L2, CMOS RF L2, CMOS RF L2, CMOS RF subbank conflicts than others, and when the register file is RF moved to CMOS, their performance improves significantly. Figure 13: Leakage Power. EQUAKE, KMEANS, MG, and RADIX are found sensi- tive to floating-point instruction latencies as they encounter many stalls due to dependents of long-latency floating-point 7. RELATED WORK instructions in the 24-cycle, STT-MRAM based floating- STT-MRAM has received increasing attention in recent point pipeline. CG, CHOLESKY, FFT, RADIX, and SWIM years at the device and circuit levels [8, 12, 26, 32, 33, 37, 39]. benefit most from increasing cache capacities under the same At the architecture level, Desikan et al. [9] explore using area budget as CMOS, by leveraging the density advantage MRAM as a DRAM replacement to improve memory band- of STT-MRAM. width and latency. Dong et al. [10] explore 3D-stacked MRAM, and propose a model to estimate the power and area 6.2 Power of MRAM arrays. Sun et al. [30] present a read-preemptive Figure 12 compares total power dissipation across all five write technique which allows an SRAM-MRAM hybrid L2 381 cache to get performance improvements and power reduc- [16] U. R. Karpuzcu, B. Greskamp, and J. Torellas. The tions. Zhou et al. [40] apply an early write termination bubblewrap many-core: Popping cores for sequential acceleration. In International Symposium on technique at the circuit level to reduce STT-MRAM write Microarchitecutre, 2009. energy. Wu et al. [34] propose a data migration scheme to [17] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A a hybrid cache architecture to reduce the number of writes 32-way multithreaded sparc processor. IEEE Micro, to resistive memories. Xu et al. [36] propose a circuit tech- 25(2):21–29, 2005. nique, which sizes transistors smaller than the worst case [18] B. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase-change memory as a scalable dram alternative. In size required to generate the switching current to improve International Symposium on Computer Architecture, Austin, density. Most of this earlier work on MRAM considers it TX, June 2009. as a DRAM or SRAM cache replacement in the system and [19] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, none of them discusses how to use resistive memories to build and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore combinational logic. architectures. In International Symposium on Computer Architecture, 2009. [20] M. Hosomi and H. Yamagishi and T. Yamamoto and K. Bessha 8. CONCLUSIONS et al. A novel nonvolatile memory with spin torque transfer In this paper, we have presented a new technique that re- magnetization switching: Spin-RAM. In IEDM Technical duces leakage and dynamic power in a deep-submicron mi- Digest, pages 459–462, 2005. [21] Micron. 512Mb DDR2 SDRAM Component Data Sheet: croprocessor by migrating power- and performance-critical MT47H128M4B6-25, March 2006. http://download.micron. hardware resources from CMOS to STT-MRAM. We have com/pdf/datasheets/dram/ddr2/512MbDDR2.pdf. evaluated the power and performance impact of implement- [22] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. ing on-chip caches, register files, memory controllers, floating- Optimizing NUCA organizations and wiring alternatives for point units, and various combinational logic blocks using large caches with CACTI 6.0. Chicago, IL, Dec. 2007. [23] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. magnetoresistive circuits, and we have explored the critical NAS parallel benchmarks. Technical report, NASA Ames issues that affect whether a RAM array or a combinational Research Center, March 1994. Tech. Rep. RNR-94-007. logic block can be effectively implemented in MRAM. We [24] U. G. Nawathe, M. Hassan, K. C. Yen, A. Kumar, have observed significant gains in power-efficiency by par- A. Ramachandran, and D. Greenhill. Implementation of an 8-core, 64-thread, power-efficient sparc server on a chip. IEEE titioning on-chip hardware resources among STT-MRAM Journal of Solid-State Circuits, 43(1):6–20, January 2008. and CMOS judiciously to exploit the unique power, area, [25] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, and speed benefits of each technology, and by carefully re- S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC architecting the pipeline to mitigate the performance impact simulator, January 2005. http://sesc.sourceforge.net. of long write latencies and high write power. [26] S. Matsunaga et al. Fabrication of a nonvolatile full adder based on logic-in-memory architecture using magnetic tunnel We believe this paper is part of an exciting new trend to- junctions. Applied Physics Express, 1(9), 2008. ward leveraging resistive memories in effecting a significant [27] S. Rusu et al. A 45nm 8-Core Enterprise Xeon Processor. In leap in the performance and efficiency of computer systems. Proceedings of the IEEE International Solid-State Circuits Conference, pages 56–57, Feb. 2009. [28] Sanu K. Mathew and Mark A. Anders and Brad Bloechel et al. 9. ACKNOWLEDGMENTS A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS. IEEE Journal of Solid-State The authors would like to thank Yanwei Song, Ravi Patel, Circuits, 40(1):44–51, January 2005. Sheng Li, and Todd Austin for useful feedback. [29] J. E. Stine, I. Castellanos, M. Wood, J. Henson, and F. Love. Freepdk: An open-source variation-aware design kit. In International Conference on Microelectronic Systems Education, 2007. 10. REFERENCES http://vcag.ecen.okstate.edu/projects/scells/. [1] V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. Clock [30] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel 3D rate vs. IPC: End of the road for conventional microprocessors. stacked MRAM cache architecture for CMPs. In In International Symposium on Computer Architecture, High-Performance Computer Architecture, 2009. Vancouver, Canada, June 2000. [31] T. Kawahara et al. 2 Mb SPRAM (spin-transfer torque RAM) [2] ALTERA. Stratix vs. Virtex-2 Pro FPGA performance with bit-by-bit bi-directional current write and analysis, 2004. parallelizing-direction current read. IEEE Journal of [3] B. Amrutur and M. Horowitz. Speed and power scaling of Solid-State Circuits, 43(1):109–120, January 2008. SRAMs. 2000. [32] T. Kishi and H. Yoda and T. Kai et al. Lower-current and fast [4] D. Burger, J. R. Goodman, and A. Kagi. Memory bandwidth switching of a perpendicular TMR for high speed and high limitations of future microprocessors. In International density spin-transfer-torque MRAM. In IEEE International Symposium on Computer Architecture, Philedelphia, PA, May Electron Devices Meeting, 2008. 1996. [33] U. K. Klostermann et al. A perpendicular spin torque [5] E. Catovic. GRFPU-high performance IEEE-754 floating-point switching based MRAM for the 28 nm technology node. In unit. http://www.gaisler.com/doc/grfpu_dasia.pdf. IEEE International Electron Devices Meeting, 2007. [6] C. Chappert, A. Fert, and F. N. V. Dau. The emergence of [34] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie. spin electronics in data storage. Nature Materials, 6:813–823, Hybrid cache architecture with disparate memory technologies. November 2007. In International Symposium on Computer Architecture, 2009. [7] M. D. Ciletti. Advanced Digital Design with the Verilog HDL. [35] Xilinx. Virtex-6 FPGA Family Overview, November 2009. 2004. http://www.xilinx.com/support/documentation/data_sheets/ [8] D. Suzuki et al. Fabrication of a nonvolatile lookup table ds150.pdf. circuit chip using magneto/semiconductor hybrid structure for [36] W. Xu, Y. Chen, X. Wang, and T. Zhang. Improving STT an immediate power up field programmable gate array. In MRAM storage density through smaller-than-worst-case Symposium on VLSI Circuits, 2009. transistor sizing. In Design Automation Conference, 2009. [9] R. Desikan, C. R. Lefurgy, S. W. Keckler, and D. C. Burger. [37] W. Xu, T. Zhang, and Y. Chen. Spin-transfer torque On-chip MRAM as a high-bandwidth, low-latency replacement magnetoresistive content addressable memory (CAM) cell for DRAM physical memories. In IBM Austin Center for structure design with enhanced search noise margin. In Advanced Studies Conference, 2003. International Symposium on Circuits and Systems, 2008. [10] X. Dong, X. Wu, G. Sun, H. Li, Y. Chen, and Y. Xie. Circuit [38] W. Zhao and Y. Cao. New generation of predictive technology and mircoarchitecture evaluation of 3D stacking magnetic model for sub-45nm design exploration. In International RAM (MRAM) as a universal memory replacement. In Design Symposium on Quality Electronic Design, 2006. Automation Conference, 2008. http://ptm.asu.edu/. [11] HiTech. DDR2 memory controller IP core for FPGA and ASIC. [39] W. Zhao, C. Chappert, and P. Mazoyer. Spin transfer torque http://www.hitechglobal.com/IPCores/DDR2Controller.htm. (STT) MRAM-based runtime reconfiguration FPGA circuit. In [12] Y. Huai. Spin-transfer torque MRAM (STT-MRAM) challenges ACM Transactions on Embedded Computing Systems, 2009. and prospects. AAPPS Bulletin, 18(6):33–40, December 2008. [40] P. Zhou, B. Zhao, J. Yang, and Y. Zhang. Energy reduction for [13] ITRS. International Technology Roadmap for STT-RAM using early write termination. In International Semiconductors: 2009 Executive Summary. Conference on Computer-Aided Design, 2009. http://www.itrs.net/Links/2009ITRS/Home2009.htm. [14] K. Tsuchida et al. A 64Mb MRAM with clamped-reference and adequate-reference schemes. In Proceedings of the IEEE International Solid-State Circuits Conference, 2010. [15] G. Kane. MIPS RISC Architecture. 1988. 382
US