Motivation
Banked Register File for • Increasing demand on RF
SMT Processors number of ports and
number of registers in 512x64b
a register file. 16R/8W
Jessica H. Tseng and Krste Asanović • Growing concerns in DC
access time, power,
MIT Computer Science and Artificial and die area. 64KB
Intelligence Laboratory,
– Example: Alpha 21464 Alpha 21464 Floorplan
Cambridge, MA 02139, USA register file (RF) occupied ISSCC, 2002
over 5X the area of 64KB
BARC2004 primary data cache (DC).
1 2
Distributed Architecture Centralized Architecture
• Duplicated RF RF • Multi-Level: RF
Register File Cache RF
– Fewer Read Ports ALU ALU
– Same Number of Write Ports Cluster 0 Cluster 1 – Fewer Read Ports ALU ALU
– Twice Total Number of Registers – Fewer Write Ports
– Alpha 21264 & Alpha 21464 – Control Logic Complexity RF RF
– Poor Locality
• Non-Duplicated RF RF ALU ALU
• One-Level Multi-Banked
– Fewer Read Ports ALU ALU
– Fewer Write Ports Cluster 0 Cluster 1 – Fewer Read Ports
– Fewer Write Ports
– Complex Inter-Cluster – Possible Conflicts
Communication – Control Logic Complexity
– Possible Pipeline Stalls
Inter-Cluster Communication
3 4
Previous Work Our Work
• Use minimal number of ports per register file • Use more ports per register file bank.
banks: 1 or 2-read port(s) and 1-write port. • Speculatively issue potentially conflicting
• Avoid issuing instructions that would cause instructions.
register file read conflicts.
– Minimize impact to the critical wakeup-select loop
– Add complexity to the critical wakeup-select loop for the issue logic
for the issue logic Æ slower cycle time
• Rapidly repair pipeline and reissue conflicting
• Resolve register file write conflicts by either instructions when conflicts are detected after
delaying physical register allocation until write issue.
back stage or installing write buffers.
– No write buffer requirement
– Complex pipeline control logic – No pipeline stalls
– Possible pipeline stalls
Simpler and Faster Control Logic
5
6
Banked Regfile Structure Regfile Characteristics
64x32b 8B1R1W 64x32b 8B2R2W • Area: Magic, 0.25µm TSMC CMOS process
• Delay & Energy: HSPICE, 2.5V supply voltage
Bank 0 Bank 0
Bank 1 Bank 1 64x32b, 8 Read Ports & 4 Write Ports
Type Baseline 8B8R4W 8B2R2W 8B2R1W 8B1R1W
Bank 2 Bank 2 Area 100% 123% 37% 32% 30%
Delay 100% 83% 75% 75% 77%
Bank 7 Bank 7 Energy 100% 61% 59% 58% 41%
8-Read 4-Write Packing
AALALUULAULU Bitline
8-Read 4-Write 8
AALAULAULULU
7
Pipeline Structure Pipeline Repair Operation
0 12 3 4 5 67 Issue Arbitrate Read
Conflict Bypass
Fetch Decode Rename Issue Arbitrate Read Execute Writeback Wakeup Select Detected
Bypass
Wakeup Issue
• Speculatively Issue Potentially Conflicting Dependents Kill Conflicting Instructions
Instructions: Same Wakeup-Select Loop Kill Following Issue Group
• Additional Arbitration Pipeline Stage Arbitrate Read
Bypass
– Detect read and write bank conflicts when too many Wakeup Select
instructions try to read from or write to the same register
file bank. Clear Ready Bits
– Mux operand addresses into available register file ports. Issue Arbitrate Read
– Adds a cycle to branch misprediction latency. Bypass
Wakeup Select
9 10
Performance Impact Reduce Bank Conflicts
• Additional Pipeline Stage • Avoid contending for register file read ports
when possible.
– Increase in branch misprediction latency.
– Bypass Skip: Operands that will be sourced from
• Possibility of Bank Conflicts the bypass network do not compete for access to the
register file.
– Add penalty cycles to repair the pipeline.
– Delay the issuing of dependent instructions. – Read Sharing: Allow multiple instructions to read
the same physical register from the same bank.
11
• Help remove the correlation between accesses
to the same bank.
12
Evaluating Performance Impact IPC Comparison
• IPC degradation simulation: modify • Average IPC degradation across SPEC CINT2000
SMTSIM simulator to keep track of a unified
physical register file organized into banks. 4
Baseline
• Register File Size: 512
• Fetch, Dispatch, Commit Width: 8 3 16B4R2W
• Integer ALUs: 8
• Memory Instructions: 4 (2-Load/2-Store) 2
• Benchmarks: Use SPEC CINT2000 1
13 0 2-thread 4-thread
1-thread
Workload 1-thread 2-thread 4-thread
Degradation
Degradation (%) 0.04 0.05 0.06
Range (%) 2.16 1.91 1.88
0.13~4.02 1.33~2.84 1.31~2.21
14
SMT Advantage Bank Conflict Probability (1)
• Multi-threading Environment 120% 8-Access
100% 7-Access
– Ability to hide the branch misprediction 6-Access
penalty. 80% 5-Access
60% 4-Access
• Large Number of Registers 3-Access
– More banks in the register file. 40%
– Dramatically reduce number of bank conflicts.
20%
15
0% 2-Bank 4-Bank 8-Bank 16-Bank 32-Bank
1-Bank
• Each bank has two local ports.
16
Bank Conflict Probability (2) Conclusion
100% 8-Access • Banked register files work better for SMT processor than
90% 7-Access the single thread superscalar processor due to SMT’s
80% 6-Access ability to hide branch misprediction penalty.
70% 5-Access
60% 4-Access • Banked register files scale well with increasing issue
50% 3-Access widths in term of die area, access latency, and power.
40% 2-Access
30% – Enough number of banks with enough number of ports.
20% – The correlation of same-cycle accesses can be removed.
10%
0% • For an eight-issue SMT processor with 512 physical
registers, by adopting a 16-banked register file design
1-Port 2-Port 3-Port 4-Port with four read ports and two write ports, we can reduce
regfile area by a factor of seven over a monolithic design
• Each design has sixteen banks. while decreasing IPC by less than 2%.
17 18
Thank You
19