Preface |
Beyond Pipelining, CISC, and RISC / Chapter 1: |
An Introduction to Superscalar Concepts / Chapter 2: |
Fundamental Limitations / 2.1: |
True Data Dependencies / 2.1.1: |
Procedural Dependencies / 2.1.2: |
Resource Conflicts / 2.1.3: |
Instruction Parallelism and Machine Parallelism / 2.1.4: |
Instruction Issue and Machine Parallelism / 2.2: |
In-Order Issue with In-Order Completion / 2.2.1: |
In-Order Issue with Out-of-Order Completion / 2.2.2: |
Out-of-Order Issue with Out-of-Order Completion / 2.2.3: |
Storage Conflicts and Register Renaming / 2.2.4: |
Related Concepts: Vliw and Superpipelined Processors / 2.3: |
Very-Long-Instruction-Word Processors / 2.3.1: |
Superpipelined Processors / 2.3.2: |
Hybrid Techniques / 2.3.3: |
Unrelated Parallel Schemes / 2.4: |
Developing an Execution Model / Chapter 3: |
Simulation Technique / 3.1: |
Benchmarking Performance / 3.2: |
Basic Observations on Hardware Design / 3.3: |
The Philosophy of the Standard Processor / 3.3.1: |
Instruction Parallelism of the Benchmarks / 3.3.2: |
Machine Parallelism / 3.3.3: |
The Design of the Standard Processor / 3.4: |
Basic Organization / 3.4.1: |
Out-of-Order Issue / 3.4.2: |
Register Renaming / 3.4.3: |
Loads and Stores / 3.4.4: |
The Performance of the Model / 3.4.5: |
The Real Performance Limit: Procedural Dependencies / 3.5: |
Background / 3.6: |
Instruction Fetching and Decoding / Chapter 4: |
Branches and Instruction-Fetch Inefficiencies / 4.1: |
Improving Fetch Efficiency / 4.2: |
Scheduling Delayed Branches / 4.2.1: |
Branch Prediction / 4.2.2: |
Aligning and Merging / 4.2.3: |
Simulation Results and Observations / 4.2.4: |
Multiple-Path Execution / 4.2.5: |
Implementing Hardware Branch-Prediction / 4.3: |
Setting and Interpreting Cache Entries / 4.3.1: |
Predicting Branches / 4.3.3: |
Hardware and Performance Costs / 4.3.4: |
Implementing A Four-Instruction Decoder / 4.4: |
Implementing Branches / 4.5: |
Number of Pending Branches / 4.5.1: |
Order of Branch Execution / 4.5.2: |
Simplifying Branch Decoding / 4.5.3: |
Reducing the Penalty of Procedural Dependencies: Observations / 4.6: |
The Role of Exception Recovery / Chapter 5: |
Buffering State Information for Restart / 5.1: |
In-Order, Lookahead, and Architectural State / 5.1.1: |
Checkpoint Repair / 5.1.2: |
History Buffer / 5.1.3: |
Reorder Buffer / 5.1.4: |
Future File / 5.1.5: |
Restart Implementation and Effect on Performance / 5.2: |
Mispredicted Branches / 5.2.1: |
Exceptions / 5.2.2: |
The Effect of Recovery Hardware on Performance / 5.2.3: |
Processor Restart: Observations / 5.3: |
Register Dataflow / Chapter 6: |
Dependency Mechanisms / 6.1: |
The Value of Register Renaming / 6.1.1: |
Register Renaming with a Reorder Buffer / 6.1.2: |
Renaming with a Future File: Tomasulo's Algorithm / 6.1.3: |
Enforcing Dependencies with Interlocks / 6.1.4: |
Copying Operands to Avoid Antidependencies / 6.1.5: |
Partial Renaming / 6.1.6: |
Special Registers and Instruction Side Effects / 6.1.7: |
Result Buses and Arbitration / 6.2: |
Result Forwarding / 6.3: |
Supplying Instruction Operands: Observations / 6.4: |
Reservation Stations / Chapter 7: |
Reservation Station Operation / 7.1.1: |
Performance Effect of Reservation-Station Size / 7.1.2: |
A Simpler Implementation of Reservation Stations / 7.1.3: |
Implementing a Central Instruction Window / 7.2: |
The Dispatch Stack / 7.2.1: |
The Register Update Unit / 7.2.2: |
Using a Reorder Buffer to Simplify the Central Window / 7.2.3: |
Operand Buses from a Central Window / 7.2.4: |
The Complexity of a Central Window / 7.2.5: |
Out-of-Order Issue: Observations / 7.3: |
Memory Dataflow / Chapter 8: |
Ordering of Loads and Stores / 8.1: |
Total Ordering of Loads and Stores / 8.1.1: |
Load Bypassing of Stores / 8.1.2: |
Load Bypassing with Forwarding / 8.1.3: |
Performance of the Load/Store Policy / 8.1.4: |
Load Side Effects / 8.1.5: |
Addressing and Dependencies / 8.2: |
Limiting Address Logic with a Preaddress Buffer or Central Instruction Window / 8.2.1: |
Effect of Store-Buffer Size / 8.2.2: |
Memory Dependency Checking / 8.2.3: |
What is More Load/Store Parallelism Worth? / 8.3: |
Esoterica: Multiprocessing Considerations / 8.4: |
Accessing External Data: Observations / 8.5: |
Complexity and Controversy / Chapter 9: |
A Brief Glimpse at Design Complexity / 9.1: |
Allocating Processor Resources / 9.1.1: |
Instruction Decode / 9.1.2: |
Instruction Completion / 9.1.3: |
The Painful Truth / 9.1.4: |
Major Hardware Features / 9.2: |
Hardware Simplifications / 9.3: |
Is the Complexity Worth it? / 9.4: |
Basic Software Scheduling / Chapter 10: |
The Benefit of Scheduling / 10.1: |
Impediments to Efficient Execution / 10.1.1: |
How Scheduling Can Help / 10.1.2: |
Is the Benefit Significant? / 10.1.3: |
Program Information Needed for Scheduling / 10.2: |
Dividing Code into Basic Blocks / 10.2.1: |
The Dataflow Graph of a Basic Block / 10.2.2: |
The Precedence Graph / 10.2.3: |
The Concept of the Critical Path / 10.2.4: |
The Resource Reservation Table / 10.2.5: |
Relationship of the Scheduler and the Compiler / 10.3: |
Interaction of Register Allocation and Scheduling / 10.3.1: |
Scheduling During Compilation Versus After Compilation / 10.3.2: |
Algorithms for Scheduling Basic Blocks / 10.4: |
The Expense of an Optimum Schedule / 10.4.1: |
List Scheduling / 10.4.2: |
The Effect of Scheduling Order / 10.4.3: |
Other Scheduling Alternatives / 10.4.4: |
Revisiting the Hardware / 10.5: |
Software Scheduling Across Branches / Chapter 11: |
Trace Scheduling / 11.1: |
A Simple Example of Trace Scheduling / 11.1.1: |
Using Compensation Code to Recover from Incorrect Predictions / 11.1.2: |
Trace Scheduling an Entire Program / 11.1.3: |
Correctness of Trace Scheduling / 11.1.4: |
Loop Unrolling / 11.2: |
Unrolling to Improve the Loop Schedule / 11.2.1: |
Unrolling with Data-Dependent Branches / 11.2.2: |
Software Pipelining / 11.3: |
Pipelining Operations from Different Loop Iterations / 11.3.1: |
Software-Pipelining Techniques / 11.3.2: |
Filling and Flushing the Pipeline: The Prologue and Epilogue / 11.3.3: |
Register Renaming in the Software-Pipelined Loop / 11.3.4: |
Global Code Motion / 11.4: |
Out-of-Order Issue and Scheduling Across Branches / 11.5: |
Evaluating Alternatives: A Perspective on Superscalar Microprocessors / Chapter 12: |
The Case for Software Solutions / 12.1: |
Instruction Formats to Simplify Hardware / 12.1.1: |
Instruction Formats for Scheduling Across Branches / 12.1.2: |
The Costs and Risks of Software Solutions / 12.1.3: |
The Case for Hardware Solutions / 12.2: |
Two Models of Performance Growth / 12.2.1: |
Estimating Risks in a Performance-Oriented Design / 12.2.2: |
Estimating Risks in a Cost-Sensitive Design / 12.2.3: |
Putting Risks in Perspective / 12.2.4: |
A Superscalar 386 / Appendix: |
The Architecture / A.1: |
Instruction Format / A.1.1: |
Register Dependencies / A.1.2: |
Memory Accesses / A.1.3: |
Complex Instructions / A.1.4: |
The Implementation / A.2: |
Out-of-Order Microinstruction Issue / A.2.1: |
Overlapping Microinstruction Sequences / A.2.2: |
Superscalar Execution of a "RISC Core" Instruction Set / A.2.3: |
Conclusion / A.3: |
References |
Index |
Preface |
Beyond Pipelining, CISC, and RISC / Chapter 1: |
An Introduction to Superscalar Concepts / Chapter 2: |