The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

1 Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2017-05-15 06:10:03

16R/8W SMT Processors - Northeastern University

1 Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA

Motivation

Banked Register File for • Increasing demand on RF
SMT Processors number of ports and
number of registers in 512x64b
a register file. 16R/8W

Jessica H. Tseng and Krste Asanović • Growing concerns in DC
access time, power,
MIT Computer Science and Artificial and die area. 64KB
Intelligence Laboratory,
– Example: Alpha 21464 Alpha 21464 Floorplan
Cambridge, MA 02139, USA register file (RF) occupied ISSCC, 2002
over 5X the area of 64KB
BARC2004 primary data cache (DC).

1 2

Distributed Architecture Centralized Architecture

• Duplicated RF RF • Multi-Level: RF
Register File Cache RF
– Fewer Read Ports ALU ALU
– Same Number of Write Ports Cluster 0 Cluster 1 – Fewer Read Ports ALU ALU
– Twice Total Number of Registers – Fewer Write Ports
– Alpha 21264 & Alpha 21464 – Control Logic Complexity RF RF
– Poor Locality
• Non-Duplicated RF RF ALU ALU
• One-Level Multi-Banked
– Fewer Read Ports ALU ALU
– Fewer Write Ports Cluster 0 Cluster 1 – Fewer Read Ports
– Fewer Write Ports
– Complex Inter-Cluster – Possible Conflicts
Communication – Control Logic Complexity
– Possible Pipeline Stalls
Inter-Cluster Communication

3 4

Previous Work Our Work

• Use minimal number of ports per register file • Use more ports per register file bank.
banks: 1 or 2-read port(s) and 1-write port. • Speculatively issue potentially conflicting

• Avoid issuing instructions that would cause instructions.
register file read conflicts.
– Minimize impact to the critical wakeup-select loop
– Add complexity to the critical wakeup-select loop for the issue logic
for the issue logic Æ slower cycle time
• Rapidly repair pipeline and reissue conflicting
• Resolve register file write conflicts by either instructions when conflicts are detected after
delaying physical register allocation until write issue.
back stage or installing write buffers.
– No write buffer requirement
– Complex pipeline control logic – No pipeline stalls
– Possible pipeline stalls
Simpler and Faster Control Logic
5
6

Banked Regfile Structure Regfile Characteristics

64x32b 8B1R1W 64x32b 8B2R2W • Area: Magic, 0.25µm TSMC CMOS process
• Delay & Energy: HSPICE, 2.5V supply voltage
Bank 0 Bank 0

Bank 1 Bank 1 64x32b, 8 Read Ports & 4 Write Ports

Type Baseline 8B8R4W 8B2R2W 8B2R1W 8B1R1W

Bank 2 Bank 2 Area 100% 123% 37% 32% 30%

Delay 100% 83% 75% 75% 77%

Bank 7 Bank 7 Energy 100% 61% 59% 58% 41%

8-Read 4-Write Packing
AALALUULAULU Bitline

8-Read 4-Write 8
AALAULAULULU
7

Pipeline Structure Pipeline Repair Operation

0 12 3 4 5 67 Issue Arbitrate Read
Conflict Bypass
Fetch Decode Rename Issue Arbitrate Read Execute Writeback Wakeup Select Detected
Bypass
Wakeup Issue
• Speculatively Issue Potentially Conflicting Dependents Kill Conflicting Instructions
Instructions: Same Wakeup-Select Loop Kill Following Issue Group

• Additional Arbitration Pipeline Stage Arbitrate Read
Bypass
– Detect read and write bank conflicts when too many Wakeup Select
instructions try to read from or write to the same register
file bank. Clear Ready Bits

– Mux operand addresses into available register file ports. Issue Arbitrate Read
– Adds a cycle to branch misprediction latency. Bypass
Wakeup Select

9 10

Performance Impact Reduce Bank Conflicts

• Additional Pipeline Stage • Avoid contending for register file read ports
when possible.
– Increase in branch misprediction latency.
– Bypass Skip: Operands that will be sourced from
• Possibility of Bank Conflicts the bypass network do not compete for access to the
register file.
– Add penalty cycles to repair the pipeline.
– Delay the issuing of dependent instructions. – Read Sharing: Allow multiple instructions to read
the same physical register from the same bank.
11
• Help remove the correlation between accesses
to the same bank.

12

Evaluating Performance Impact IPC Comparison

• IPC degradation simulation: modify • Average IPC degradation across SPEC CINT2000
SMTSIM simulator to keep track of a unified
physical register file organized into banks. 4
Baseline
• Register File Size: 512
• Fetch, Dispatch, Commit Width: 8 3 16B4R2W
• Integer ALUs: 8
• Memory Instructions: 4 (2-Load/2-Store) 2

• Benchmarks: Use SPEC CINT2000 1

13 0 2-thread 4-thread
1-thread

Workload 1-thread 2-thread 4-thread
Degradation
Degradation (%) 0.04 0.05 0.06
Range (%) 2.16 1.91 1.88
0.13~4.02 1.33~2.84 1.31~2.21

14

SMT Advantage Bank Conflict Probability (1)

• Multi-threading Environment 120% 8-Access
100% 7-Access
– Ability to hide the branch misprediction 6-Access
penalty. 80% 5-Access
60% 4-Access
• Large Number of Registers 3-Access

– More banks in the register file. 40%
– Dramatically reduce number of bank conflicts.
20%
15
0% 2-Bank 4-Bank 8-Bank 16-Bank 32-Bank
1-Bank

• Each bank has two local ports.

16

Bank Conflict Probability (2) Conclusion

100% 8-Access • Banked register files work better for SMT processor than
90% 7-Access the single thread superscalar processor due to SMT’s
80% 6-Access ability to hide branch misprediction penalty.
70% 5-Access
60% 4-Access • Banked register files scale well with increasing issue
50% 3-Access widths in term of die area, access latency, and power.
40% 2-Access
30% – Enough number of banks with enough number of ports.
20% – The correlation of same-cycle accesses can be removed.
10%
0% • For an eight-issue SMT processor with 512 physical
registers, by adopting a 16-banked register file design
1-Port 2-Port 3-Port 4-Port with four read ports and two write ports, we can reduce
regfile area by a factor of seven over a monolithic design
• Each design has sixteen banks. while decreasing IPC by less than 2%.

17 18

Thank You

19


Click to View FlipBook Version