

## System z: A Peek Under the Hood

Tim Slegel IBM Distinguished Engineer System z Processor Development

October 19, 2010

© 2010 IBM Corporation



## **Topics**

- Review of recent System z mainframes
- Processor hardware overview
- Millicode and Virtualization
- Cache/memory subsystem
- Performance
- New Instruction Set Architecture for z196
- zBX: A system of systems
- Energy efficiency

### IBM zEnterprise 196 Continues the CMOS Mainframe Heritage



### System Hardware, Firmware, and Software





### Quad Core zEnterprise 196 Processor Chip



- 45nm SOI Technology
  - 13 layers of metal
  - 3.5 km wire

- Chip Area 512.3mm<sup>2</sup>
  - 23.5mm x 21.8mm
  - 8093 Power C4's
  - 1134 signal C4's

- Up to Four active cores per chip
  - 5.2 GHz system operation fastest processor in the world
  - L1 cache/ core
    - 64 KB I-cache
    - 128 KB D-cache
  - 1.5 MB private L2 cache/ core
- Two Co-processors (COP)
  - Crypto & compression accelerators
  - Includes 16KB cache
  - Shared by two cores
- 24MB eDRAM L3 Cache
  - Shared by all four cores
- Interface to SC chip / L4 cache
  - 40+ GB/sec to each of 2 SCs
- I/O Bus Controller (GX)
  - Interface to Host Channel Adapter (HCA)
- Memory Controller (MC)
  - Interface to controller on memory DIMMs
  - Supports RAIM design

# z196 Microprocessor Core



# z196 Out-of-Order (OOO) Operation

#### Design:

- Instruction decoding and completion in architected program order
- Operand address generation, operand access, and instruction execution can occur out of order
- Special circuitry makes all out-of-order operation invisible to software

#### Performance value

- Reordering instruction execution around operand dependencies
- Reordering storage accesses around address dependencies
- Hiding storage access latency
- Allowing full utilization of varying-depth pipelines



7

#### IBM

## Basic z196 OOO Terminology

#### IFU (in-order)

- Fetches instructions
- Forms and sends clumps of 1 to 3 instructions to IDU

#### IDU (in-order)

- Decodes z/Architecture instructions
- Cracks or expands complex instructions into multiple  $\mu$  ops
  - Cracking results in 2 or 3 μops (one group)
  - Expansion results in >3 μops (multiple groups)
- μop is fundamental unit of work which issues to one execution unit in one execution slot (generally) and updates at most one register or DW in storage
- Creates a dispatch group of 1 to 6  $\mu$  ops
- Dispatches up to 1 group to ISU per cycle
- ISU
  - Issues  $\mu$  ops to execution units (LSU, FXU, BFU, DFU)
  - Rescinds instructions which need to be re-executed
    - E.g. instruction dependent on load which misses D\$
  - Flushes groups of instructions which need to be discarded or re-executed
    - E.g. branch mispredictions
- Execution units
  - Reject  $\mu$  ops which cannot be executed (E.g. D\$ miss, TLB miss, ...)
  - Create finish reports after  $\mu$  ops have executed
- Completion (ISU)
  - Completes up to 1 group per cycle in order



# z196 Microprocessor Pipeline





## z196 CPU core

- Each core is a superscalar, out of order processor with these characteristics:
  - The cycle time is 5.2 GHz
  - Six RISC-like execution units
    - 2 fixed point (integer), 2 load/store, 1 binary floating point, 1 decimal floating point
  - Up to three instructions decoded per cycle (vs. 2 in z10)
  - Up to five instructions/operations executed per cycle (vs. 2 in z10)
    - Execution can occur out of (program) order
    - Memory address generation and memory accesses can occur out of (program) order
    - Special circuitry to make execution and memory accesses appear in order to software
  - Each core has 3 private caches
    - 64KB 1<sup>st</sup> level cache for instructions, 128KB 1<sup>st</sup> level cache of data
    - 1.5MB L2 cache containing both instructions and data
- The same physical processor can be used for all of the following CPU types:
  - Normal client CPUs
  - Specialty Engines: zIIPs, zAAPs, IFLs
  - Coupling Facilities
  - SAPs I/O and service processors
  - Spare CPUs used for Dynamic Processor Sparing in the event of a failing processor



# Extensive use of hardware speculation

- z/Architecture places many strict constraints on how the CPU has to appear to be behave
  - Example Strict storage ordering rules (see POPS chapter 5)
  - Good for software significantly easier and more robust MP programming than other ISAs
  - Bad for the CPU design team difficult to achieve good performance

#### CPU has to make use of speculative processing techniques

- Assume things will go well, and have mechanisms to detect and backoff if they do not
- In CPU design, "It's OK to cheat as long as you don't get caught."
- Under the covers, the CPU violates the storage ordering rules in POPS, but has extensive/complex logic to detect if software might observe it violating those rules. If it detects possible observation, it needs to redo the operation precisely following POPS rules.
- Result is software only can observe the CPU following all rules

## Instruction Set Architecture (ISA)



- Most complex instructions are executed by millicode
  - Another 24 instructions are conditionally executed by millicode
- Medium complexity instructions cracked at decode into 2 or more µops
- Most RX instructions cracked at issue → dual issued
  - RX have one storage operand and one register operand
- Some storage-storage ops executed by LSU sequencer
- Remaining z instructions are RISC-like and map to single µop

### Millicode

- Our name for the vertical microcode that executes on the processor
- Runs in a special mode on the normal processor pipeline no specialized microcode engine
- Written in assembler (with optimizers and semantic correctness tools)
- Most z/Architecture instructions are available for use in millicode routines
- Resides in HSA and is cached in the I-cache. Storage operands can be in the D-cache.
- Separate set of millicode General Purpose Registers
- Special millicode assist instructions
  - Move data to/from micro-architected control registers and facilities
  - Performance enhancing instructions (over the years, some of these have been transferred to POPS and are usable by normal software)
  - Pipeline controls
  - Ability to move data anywhere in storage between LPAR partitions or to/from HSA
  - CoP access for crypto and compression
  - Perform System Operations (page mover engine, multi-CPU operations such as broadcast TLB purges, I/O operations, service functions, etc.)
- Interestingly, for some millicode instructions full pipeline interlocks are not maintained in hardware
  - E.g., read after write of a special register may not yield the updated value
  - Improves performance and simplifies hardware complexity but makes it more difficult to write millicode



## Hardware/Millicode Support for Virtualization

#### Full <u>logical</u> virtualization via the START INTERPRETIVE EXECUTION (SIE) instruction

#### Nested SIE supports two level guests:

- LPAR Hypervisor (firmware) runs natively. First level guests are normal OSes (e.g., zOS, zLinux, zVM). Up to 60 1<sup>st</sup> level partitions.
- If zVM is running as a first level guest, then it supports hundreds (or thousands) of second level guests (e.g., zLinux)

#### Separate hardware Host/Guest-1/Guest-2 facilities:

- z/Architecture control registers
- Timing Facility (including interrupt controls)
- All important SIE State Description controls are buffered into hardware control registers during SIE-entry/exit, which is performed by millicode
- Hardware detects most SIE Intercept and Intervention conditions

#### Full hardware support for SIE address translation:

- RRF supports zone relocation (and zone based I/O interrupts)
- Multi-level pageable guest support (up to 56 table fetches required for a single 2<sup>nd</sup> level guest ART/DAT translation)
- MCDS handling of ARs
- TLB2 holds multiple SIE guest entries simultaneously
- Appropriate TLB purging on all CPUs for IPTE/IDTE operations with filtering

# **Timing Facility**

- Master Time-of-Day (TOD) kept on one SC chip in the system
- All processors have their own local copy of the TOD
  - Provides faster access to the TOD for STCK, STCKF, TRACE, etc.
  - Full TOD, CPU Timer, Clock Comparator
  - Logic to provide system-wide uniqueness and monotonically increasing (as required by POPS)
- Synchronization pulse provides check and precise timebase
- Server Time Protocol (STP) provides for synchronization between multiple systems
  - Hardware provides interrupt to millicode when TOD steering is required

# z196 Compression and Cryptography Accelerator

- Data compression engine
  - Static dictionary compression and expansion
  - Dictionary size up to 64KB (8K entries)
    - Local 16KB cache per core for dictionary data
- CP Assist for Cryptographic Function (CPACF)
  - Enhancements for new NIST standard
  - Complemented prior ECB and CBC symmetric cipher modes with XTS, OFB, CTR, CFB, CMAC and CCM
  - New primitives (128b Galois Field multiply) for GCM
- Accelerator unit shared by 2 cores
  - Independent compression engines
  - Shared cryptography engines



T. Slegel



17



### Hub / Shared Cache Chip for z196



- eDRAM Shared L4 Cache
  - 96 MB per SC chip
  - 192 MB per Node
- 6 CP chip interfaces
  - 40+ GB/sec each
- 3 Fabric interfaces
  - 40+ GB/sec each
- 45nm SOI Technology
  - 13 layers of metal
- Chip Area 478.8mm<sup>2</sup>
  - 24.4mm x 19.6mm
  - 7100 Power C4's
  - 1819 signal C4's

#### 1.5 Billion Transistors

- 1 Billion cells for eDRAM

# z196 MCM / Book

- 96mm x 96mm MCM
  - 103 Glass Ceramic wiring layers
  - -8 chip sites
  - -7356 LGA connections
  - Up to 24 active cores
  - Up to 1800W power dissipation





#### 4 book System

- Fully connected topology
- 96 Total CPUs
- 12 Memory Controllers
- Up to 32 IO Hub port
- Up to 3TB Memory capacity



T. Slegel



# z196 RAIM Memory Structure

#### Redundant Array of Independent Memory

- 5 channel memory controller
- DIMM bus CRC error retry
- Industry leading reliability

#### • Up to 3TB Memory capacity

3 MCUs per MCM

T. Slegel

2-deep DIMM cascade





# z196 Book Layout





## IBM z196 System





# Reliability/Availability Features

- Near 100% hardware error detection for logic faults far higher than other platforms
- Multi-level error recovery capability:
  - On-the-fly error correction of array errors. Automatically deletes failing sections of arrays for solid errors.
  - Within the processor, all instructions are checkpointed in fault-hardened registers/arrays. If a hardware error is detected, processor retry allows for the re-execution of the failed instruction. Effective for soft-errors.
  - In the event of a hard-error where retry is unsuccessful, Dynamic Processor Sparing moves the entire micro-architected state to a spare processor. Happens transparently to software and even the OS.

#### IBM

### IBM compilers exploit System z for maximum performance

- Compilers exploit new hardware instructions introduced by System z
- Code generated by the compilers is highly tuned for System z
- Boost in performance of applications running on System z

z/OS XL C/C++



- Enterprise COBOL for z/OS
- Enterprise PL/I for z/OS
- 135 new / changed instructions



# Java and WAS performance with zEnterprise

World class per-thread performance yields outstanding results:

| results:                                                                                                   | System z10    | Uplevel          | zEnterprise      |  |
|------------------------------------------------------------------------------------------------------------|---------------|------------------|------------------|--|
| CPU benchmark                                                                                              | announce      | software         | hardware         |  |
| 63%                                                                                                        |               |                  |                  |  |
| ILOG/CONfirm                                                                                               |               |                  |                  |  |
| 45-62%                                                                                                     |               |                  |                  |  |
| Multi-threaded                                                                                             |               |                  |                  |  |
| 45%                                                                                                        |               |                  |                  |  |
| WebSphere V7                                                                                               |               |                  |                  |  |
| up to 93%                                                                                                  | WebSphere     | WebSphere        | WebSphere        |  |
| <ul> <li>Extensive hardware and software<br/>collaboration with deep platform<br/>exploitation:</li> </ul> | Version 7     | Version 7        | Version 7        |  |
|                                                                                                            | Announce      | JPA Feature Pack | JPA Feature Pack |  |
|                                                                                                            | DayTrader 2.0 | DayTrader 2.0    | DayTrader 2.0    |  |
| <ul> <li>New out of order pipeline design</li> </ul>                                                       | No Caching    | Data Caching     | Data Caching     |  |
| <ul> <li>New instructions optimized for<br/>software usage</li> </ul>                                      | System z10    | System z10       | zEnterprise      |  |
| <ul> <li>Java runtime environment general<br/>optimizations</li> </ul>                                     | Then          |                  | Now              |  |

# z196 New Instruction Set Architecture

#### High word extension

- General register high word independently addressable
- Gives software 32 word-sized registers
- Add/subtracts, compares, rotates, loads/stores

#### New atomic ops

- Load and "arithmetic" (ADD, AND, XOR, OR)
  - •(Old) storage location value loaded into GR
  - •Arithmetic result overwrites value at storage location
- Load Pair Disjoint

T. Slegel

•Load from two different storage locations into GR N, N+1

•Condition code indicates whether fetches were atomic

#### Conditional load, store, register copy

- Based on condition code
- Used to eliminate unpredictable branches

Old code

compare

instruction X

branch

load



New code

compare

conditional load instruction X



### z196 New Instruction Set Architecture

#### Distinct-Operands Facility (22 new instructions)

- Independent specification of result register (different than either source register)
- Reduces register value copying

#### Population-Count Facility (1 new instruction)

- Hardware implementation of bit counting ~5x faster than prior software implementations
- Integer to/from Floating point converts (39 new instructions)
- New truncate and OR inexactness Binary Floating Point rounding mode
- New Decimal Floating Point quantum exception
  - Eliminates need for test data group for every operation



## z196 New Instruction Set Architecture

#### Virtual Architecture Level

- Allows the zVM Live Guest Relocation Facility to make a z196 behave architecturally like a z10 system
- Facilitates moving work transparently between z196 and z10 systems for backup and capacity reasons

#### Non-quiescing SSKE:

- Significant performance improvement for systems with large number CPUs (typically 30+)
- Improves MP ratio for larger systems
- Up to 10% performance increase when exploited by the operating system
- Exploited by all zOS 1.10 and above (with PTF for 1.10 and 1.11),
- Will be exploited by Linux and zVM

#### PER Zero Address Detect

Improved debug capability to detect uninitialized pointers

#### Other minor architecture features

 RRBM, Fast-BCR-Serialization Facility, Fetch-Store-Access Exception Indicator, CMPSC Enhancement Facility



## Announcing the IBM zEnterprise System: *A New Dimension in Computing*



- A "System of Systems", integrating IBM's leading technologies to dramatically improve productivity of today's multi-architecture data centers and tomorrow's private clouds.
- The world's fastest and most scalable enterprise system with unrivalled reliability, security, and manageability.
- The industry's most efficient platform for large scale data center simplification and consolidation.

# IBM zEnterprise System – Best-in-class systems and software technologies

A "System of Systems" that unifies IT for predictable service delivery



#### IBM zEnterprise 196 (z196)

- Optimized to host large-scale database, transaction, and mission-critical applications
- The most efficient platform for large-scale Linux consolidation
- Capable of massive scale-up
- New easy-to-use z/OS V1.12

#### \* All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represents goals and objectives only.

#### zEnterprise Unified Resource Manager

- Unifies management of resources, extending IBM System z qualities of service end-to-end across workloads
- Provides platform, hardware and workload management

#### zEnterprise BladeCenter Extension (zBX)

- Select IBM POWER7<sup>®</sup> and IBM x86\* blades for tens of thousands of AIX and Linux applications
- High-performance optimizers and appliances to accelerate time to insight and reduce cost
- Dedicated high-performance private network

T. Slegel

### zBX – Infrastructure to support more resources

#### zBX houses the multiplatform solutions key to the zEnterprise System.

- Optimizers that are dedicated to workloads.
  - IBM Smart Analytics Optimizer and WebSphere DataPower appliance<sup>1</sup>
  - Closed environments with hardware and software included in solution
  - Individualized tools for sizing and customizing dependant on the optimizer
- Select IBM POWER7 and IBM x86<sup>1</sup> blades running any application supported by the operating system installed on the blade – with no change.
- Mix and match Optimizer and select general purpose POWER7 and IBM x86 blades in the same rack.
- zBX is a System z machine type for integrated fulfillment, maintenance, and support

#### Secure network connection between zBX and z196 for data and support.

- Fast 10 Gb Ethernet connection to the data
- Less latency fewer 'hops' to get to the data and no need for encryption / firewall
- Traffic on user networks not affected.
- Sharing of resources up to eight z196 servers can attach to the zBX and have access to solutions
- Configuration, support, monitoring, management - all by Unified Resource Manager



1. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represents goals and objectives only.



#### Putting zEnterprise System to the task

Use the smarter solution to improve your application design



### z196 – Helping to control energy consumption in the data center

- Better control of energy usage and improved efficiency in your data center
- New water cooled option allows for energy savings without compromising performance
  - Maximum capacity server has improved power efficiency of 60% compared to the System z10 and a 70% improvement with water cooled option
- Savings achieved on input power with optional High Voltage DC by removing the need for an additional DC to AC inversion step in the data center
- Improve flexibility with overhead cabling option while helping to increase air flow in a raised floor environment
- z196 is same footprint as the System z10 EC<sup>1</sup>



1. With the exception of water cooling and overhead cabling

33



### z196 capacity per watt improvements



| 15 years of CMOS: G2 to z196 * |              | Net Effect: G2 to z196 *          |       |                                                                      |
|--------------------------------|--------------|-----------------------------------|-------|----------------------------------------------------------------------|
| Power Increase:                | 17% per year | Performance increased by:         | ~300x | Note: Capacity/kWatt assumes hot<br>room, max plugged I/O power, max |
| Performance increase:          | 46% per year | Performance / kWatt increased by: | ~30x  | memory power and all engines<br>turned on. Real world max            |
| Power density increase:        | 13% per year | Performance / sq ft increased by: | ~190x | capacity system is about 3/4 of this.                                |

34

## Summary

- There is a lot of hardware/firmware complexity under the hood for:
  - Performance
  - Reliability
  - "But, we worry about the details, so you don't have to."
- Instruction Set Architecture continues to evolve
  - Close collaboration with software to optimize performance and functionality
- zBX opens up a new dimension in System z
  - Will likely continue this trend with more accelerator functions
- Energy efficiency will continue to improve



## Thank you!

- Feel free to contact me offline with processor hardware questions on IBM System z performance, functionality, etc.
- e-mail: slegel@us.ibm.com