## **IBM System z13 Overview**

Dr. Fadi Busaba <u>busaba@us.ibm.com</u> Adam Collura <u>collura@us.ibm.com</u>





#### The following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.

Not all common law marks used by IBM are listed on this page. Failure of a mark to appear does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market.

Those trademarks followed by (B) are registered trademarks of IBM in the United States; all others are trademarks or common law marks of IBM in the United States.

For a more complete list of IBM Trademarks, see www.ibm.com/legal/copytrade.shtml:

\*BladeCenter®, CICS®, DataPower®, DB2®, e business(logo)®, ESCON, eServer, FICON®, IBM®, IBM (logo)®, IMS, MVS, OS/390®, POWER6®, POWER6+, POWER7®, Power Architecture®, PowerVM®, PureFlex, PureSystems, S/390®, ServerProven®, Sysplex Timer®, System p®, System p5, System x®, z Systems®, System z9®, System z10®, WebSphere®, X-Architecture®, z13™, z Systems™, z9®, z10, z/Architecture®, z/OS®, z/VM®, z/VSE®, zEnterprise®, zSeries®

#### The following are trademarks or registered trademarks of other companies.

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom.

Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.

IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency, which is now part of the Office of Government Commerce.

#### \* All other products may be trademarks or registered trademarks of their respective companies.

Notes:

Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here.

IBM hardware products are manufactured Sync new parts, or new and serviceable used parts. Regardless, our warranty terms apply.

All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.

This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the product or services available in your area.

All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

Information about non-IBM products is obtained Sync the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.

### Glossary

| ASIC            | Application-specific integrated circuit                                                                                                                                                                                                     |  |  |  |  |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| ВРН             | Bulk Power Hub                                                                                                                                                                                                                              |  |  |  |  |
| CCA             | Common Cryptographic Architecture - IBM software that enables a consistent approach to cryptography on major IBM computing platforms                                                                                                        |  |  |  |  |
| CPC Drawer      | CPC drawer refers to the packaging of the PU and SC SCMs, Memory and PCIe Gen3, ICA-SR and IFB fanouts                                                                                                                                      |  |  |  |  |
| CS5             | Coupling Short Reach Generation 5 - CHPID type on z13 for ICA-SR short reach coupling links                                                                                                                                                 |  |  |  |  |
| FPGA            | Field-programmable gate array                                                                                                                                                                                                               |  |  |  |  |
| IBM zAware      | IBM z Advanced Workload Analysis Reporter. Provides near real-time detection of anomalous situations in the system, based on the system's past behavior an continuous monitoring.                                                           |  |  |  |  |
| ICA SR          | Integrated Coupling Adapter                                                                                                                                                                                                                 |  |  |  |  |
| I/O Drawer      | I/O drawer connected to InfiniBand fanouts supporting the 6 GBps InfiniBand I/O interconnect. For z13, FICON Express8 is the only I/O feature supported in this drawer                                                                      |  |  |  |  |
| кум             | Kernel-based Virtual Machine - Open source software providing a full virtualization solution for Linux                                                                                                                                      |  |  |  |  |
| Node            | A Node can be a z13 CPC or/and a standalone zBX Model 004 in an Ensemble. For prior generation systems it's a zEC12, zBC12, z196 or z114 and any optionally attached zBX. A node can be a member of only one ensemble                       |  |  |  |  |
| PCIe I/O Drawer | PCIe I/O drawer connected to PCI Express Generation 2 (PCIe Gen2) 8 GBps I/O interconnect infrastructure introduced with z196/z114 or PCI Express Generation 3 (PCIe Gen3) 16 GBps PCIe I/O interconnect infrastructure introduced with z13 |  |  |  |  |
| RAIM            | Redundant array of independent memory (RAIM). A new technology introduced with z196 designed to provide protection at the direct random access memory (DRAM), dual inline memory module (DIMM), and memory channel level                    |  |  |  |  |
| RDMA            | Remote direct memory access                                                                                                                                                                                                                 |  |  |  |  |
| RG              | Resource Group                                                                                                                                                                                                                              |  |  |  |  |
| RoCE            | RDMA over Converged Enhanced Ethernet                                                                                                                                                                                                       |  |  |  |  |
| SCH             | System Control Hub                                                                                                                                                                                                                          |  |  |  |  |
| SCM             | Single Chip Module. For z13, these can be either the Processor Unit (PU) or System Controller (SC) modules                                                                                                                                  |  |  |  |  |
| SIMD            | Single Instruction Multiple Data - Vector processing model providing instruction level parallelism, benefits workloads such as analytics and mathematical modeling                                                                          |  |  |  |  |
| SLC             | Separately licensed code. Internal zBX code that is licensed separately from the zBX's LIC                                                                                                                                                  |  |  |  |  |
| SMT             | Simultaneous multithreading - Architectural concept of a core, which multithreading is enabled, comprises a group of CPUs (sometimes called threads)                                                                                        |  |  |  |  |
| SWIT            |                                                                                                                                                                                                                                             |  |  |  |  |
| <u> </u>        | Shared Memory Communications – Remote Direct Memory Access                                                                                                                                                                                  |  |  |  |  |
| SMC-R<br>zEDC   | Shared Memory Communications – Remote Direct Memory Access<br>zEDC Express - Hardware feature for z13, zEC12 and zBC12. Integrated solution with software capability of zEDC in z/OS V2.1 for compression acceleration                      |  |  |  |  |

### System z: Integrated by design





### **IBM z Systems High End Generations**

| N-4                                                                           | N-3                                                                     | N-2                                                                        | N-1                                                                       | N                                                                         |
|-------------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------|
|                                                                               |                                                                         |                                                                            |                                                                           |                                                                           |
| <b>z9 Enterprise Class</b><br>■Announced 7/2005                               | z10 Enterprise Class<br>■Announced 2/2008                               | zEnterprise 196<br>■Announced 7/22/2010                                    | zEnterprise EC12<br>•Announced 8/28/2012                                  | IBM z13<br>•Announced 1Q2015                                              |
| <ul> <li>Withdrawn 6/30/2010</li> <li>Chip: 2 core, 1.7 GHz</li> </ul>        | <ul> <li>Withdrawn 6/30/2012</li> <li>Chip: 4 core, 4.4 GHz</li> </ul>  | ■Withdrawn 6/30/2014<br>■Chip 4 core, 5.2 GHz                              | <ul> <li>Chip: 6 core, 5.5 GHz</li> <li>Up to 101 client cores</li> </ul> | <ul> <li>Chip: 8 core, 5.0 GHz</li> <li>Up to 141 client cores</li> </ul> |
| <ul> <li>Up to 54 client cores</li> <li>CP, IFL, ICF, zAAP, zIIP</li> </ul>   | ■Up to 64 client cores<br>■CP, IFL, ICF, zAAP, zIIP                     | ■Up to 80 client cores<br>■CP, IFL, ICF, zAAP, zIIP                        | <ul> <li>CP, IFL, ICF, zAAP, zIIP</li> <li>Single thread</li> </ul>       | <ul> <li>CP, IFL, ICF, zIIP</li> <li>SMT: zIIP, IFL</li> </ul>            |
| <ul> <li>Single thread</li> <li>zIIP-zAAP to CP ratio 1x1</li> </ul>          | <ul> <li>Single thread</li> <li>zIIP-zAAP to CP ratio 1x1</li> </ul>    | <ul> <li>Single thread</li> <li>zIIP-zAAP to CP ratio 1x1</li> </ul>       | ■zIIP-zAAP to CP ratio 2x1<br>■Uni MIPS: 1,514                            | •zIIP to CP ratio 2x1<br>•Uni MIPS: 1,695                                 |
| ■Uni MIPS: 560                                                                | ■Uni MIPS: 902                                                          | ■Uni MIPS: 1,202                                                           | ■Max MIPS: 78,426                                                         | ■Max MIPS: 111,556                                                        |
| <ul> <li>Max MIPS: 18,505</li> <li>Max mem 512 GB - HSA</li> </ul>            | •Max MIPS: 31,826<br>•Max mem 1.5 TB                                    | <ul> <li>Max MIPS: 52,286</li> <li>Max mem 3 TB (RAIM)</li> </ul>          | <ul> <li>Max mem 3 TB (RAIM)</li> <li>Max per LPAR: 1 TB</li> </ul>       | <ul> <li>Max mem: 10 TB (RAIM)</li> <li>Max per LPAR: 10 TB</li> </ul>    |
| <ul> <li>Max/LPAR: 512 GB - HSA</li> <li>LCSS: 4, LPARs: 60</li> </ul>        | <ul> <li>Max per LPAR: 1 TB</li> <li>LCSS: 4, LPARs: 60</li> </ul>      | <ul> <li>Max per LPAR: 1 TB</li> <li>LCSS: 4, LPARs: 60</li> </ul>         | <ul> <li>LCSS: 4, LPARs: 60</li> <li>Subchannel Sets: 3/LCSS</li> </ul>   | <ul> <li>LCSS: 6, LPARs: 85</li> <li>Subchannel Sets: 4/LCSS</li> </ul>   |
| <ul> <li>Subchannel Sets: 2/LCSS</li> <li>Max I/O slots: 84</li> </ul>        | <ul> <li>Subchannel Sets: 2/LCSS</li> <li>Max I/O slots: 84</li> </ul>  | <ul> <li>Subchannel Sets: 3/LCSS</li> <li>Max I/O Slots: 160*</li> </ul>   | <ul> <li>Max I/O Slots: 160*</li> <li>Max FICON channels: 320</li> </ul>  | <ul> <li>Max I/O Slots: 160*</li> <li>Max FICON Channels: 320</li> </ul>  |
| <ul> <li>Max FICON channels: 336</li> <li>Max FICON Express4 (GA2)</li> </ul> | <ul> <li>Max FICON channels: 336</li> <li>FICON Express4</li> </ul>     | <ul> <li>Max FICON channels: 320</li> <li>FICON Express8S (GA2)</li> </ul> | <ul> <li>FICON Express8S</li> <li>Max OSA Ports: 96</li> </ul>            | <ul> <li>FICON Express16S</li> <li>Max OSA Ports: 96</li> </ul>           |
| <ul> <li>Max OSA Ports: 48</li> <li>OSA-Express2</li> </ul>                   | <ul> <li>Max OSA Ports: 96</li> <li>OSA-Express3</li> </ul>             | <ul> <li>Max OSA Ports: 96</li> <li>OSA-Express4S (GA2)</li> </ul>         | <ul> <li>OSA-Express5S (GA2)</li> <li>Crypto Expres4S</li> </ul>          | ■OSA-Express5S                                                            |
| ■Crypto Express2                                                              | ■Crypto Express3 (GA3)                                                  | ■Crypto Express3                                                           | Coupling:                                                                 | <ul> <li>Crypto Express5S</li> <li>Coupling:</li> </ul>                   |
| <ul> <li>Coupling: ISC3, IFB,<br/>PSIFB:12x SDR</li> </ul>                    | <ul> <li>Coupling: ISC3, IFB</li> <li>PSIFB: 12x DDR, 1x DDR</li> </ul> | <ul> <li>Coupling: ISC3</li> <li>PSIFB: 12x DDR, 1x DDR</li> </ul>         | PSIFB: 12x DDR, 1x DDR<br>•ASHRAE Class A1                                | PSIFB: 12x DDR, 1x DDR<br>•ASHRAE Class A2                                |
|                                                                               | ■ASHRAE Class A1                                                        | <ul> <li>ASHRAE Class A1</li> </ul>                                        | <ul> <li>Native PCIe: zEDC,<br/>Flash Express</li> </ul>                  | <ul> <li>PCIe: Gen3 16 GBps</li> <li>Native PCIe: zEDC,</li> </ul>        |
|                                                                               |                                                                         |                                                                            | 10 GbE RoCE                                                               | Flash Express<br>10GbE RoCE with SR-IOV                                   |
|                                                                               |                                                                         |                                                                            |                                                                           |                                                                           |



## **IBM z13 platform positioning**



- The world's premier transaction and data engine now enabled for the mobile generation
- The integrated transaction and analytics system for right-time insights at the point of impact
- The world's most efficient and trusted cloud system that transforms the economics of IT

### **IBM z Systems**

An integrated, highly scalable computer system that allows many different pieces of work to be handled at the same time, sharing the same information as needed with protection, handling very large amounts of information for many users with security, without users experiencing any failures in service



- Large scale, robust consolidation platform
- Built-in Virtualization
- 100's to 1000's of virtual servers on z/VM
- Intelligent and autonomic management of diverse workloads and system resources

<sup>\*</sup>zAAPs not available on z13

#### IBM z Systems

### z13 Continues the CMOS Mainframe Heritage Begun in 1994



\* MIPS Tables are NOT adequate for making comparisons of z Systems processors. Additional capacity planning required

\*\* Number of PU cores for customer use

IBM.



### z13 System Design Changes

- 22nm Processor with SIMD, SMT
- Integrated I/O with PCIe Direct Attach – 16 GBPS
- Single Chip Modules
- Drawer-Based CPC Design
- Cable-Based SMP Fabric
- Oscillator Backplane
- Flexible Service Processor (FSP2)
- Integrated Sparing
- On-chip power/thermal monitor / control



- New Memory Controller
- Crypto Express5S
- FICON Express16S
- 1U Support Element
- Standalone zBX Node Hybrid Computing
- 2.7M lines of firmware changed
- Radiator Design improvements
- Expanded operating environment (Rear Doors)

| Core0      | F       | G      | ore2     |           | 101 017 | 11810    | 001 QL1 | e | ore4 |   |
|------------|---------|--------|----------|-----------|---------|----------|---------|---|------|---|
| F<br>Core1 | 100 001 | L3B 00 | 1.10.000 | 141 NJ CH | - 1110  | 31       | 0110    | c | oteő |   |
| Core3      | 011     | 30     | 010      | 5         | Co      | 2<br>re5 | L3K     | • | ore7 | 9 |





© 2015 IBM Corporation

#### IBM z Systems



### **IBM z13: The New Possible**





## **z13 Details**

### z13 z/Architecture / Micro-architecture Enhancements

- Core micro-architecture radically altered to increase parallelism and to improve instruction execution.
- Simultaneous multithreading (SMT) operation
- Single Instruction Multiple Data (SIMD) instruction set and execution: Business Analytics Vector Processing

Single Thread Performance Equation= Code length \* Clock cycles per Instruction \* Cycle Time

### z13 8-Core Processor Unit (PU) Chip Detail



- 14S0 22nm SOI Technology
   Chip Area
  - 17 layers of metal
  - 3.99 Billion Transistors
  - 13.7 miles of copper wire
- $678.8 \text{ mm}^2$ 
  - 678.8 mm²
- 28.4 x 23.9 mm
- 17,773 power pins
- 1,603 signal I/Os

- Up to eight active cores (PUs) per chip
  - -5.0 GHz (v5.5 GHz zEC12)
  - -L1 cache/ core
    - 96 KB I-cache
    - 128 KB D-cache
  - -L2 cache/ core
    - 2M+2M Byte eDRAM split private L2 cache
- Single Instruction/Multiple Data (SIMD)
- Single thread or 2-way simultaneous multithreading (SMT) operation
- Improved instruction execution bandwidth:
  - -Greatly improved branch prediction and instruction fetch to support SMT
  - Instruction decode, dispatch, complete increased to 6 instructions per cycle
  - -Issue up to 10 instructions per cycle
  - -Integer and floating point execution units
- On chip 64 MB eDRAM L3 Cache
  - -Shared by all cores
- I/O buses
  - -One InfiniBand I/O bus
  - -Two PCIe I/O buses
- Memory Controller (MCU)
  - -Interface to controller on memory DIMMs
  - -Supports RAIM design



### **z13 Drawer Structure and Interconnect**



- S-bus: SC to SC chip in the same drawer
- A-bus: SC to SC chips in the remote drawers

### z System Cache Topology – zEC12 vs. z13 Comparison

| 4 L4 Caches                                                                            | 8 L4 Caches                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 384MB                                                                                  | NIC 480MB                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Shared eDRAM L4                                                                        | Directory Shared eDRAM L4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                                                                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 48MB Shr<br>eDRAM L3 6 L3s, eDRAM L3                                                   | Intra-node Interface                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| I I I I I I I 36 L1 / L2s I I I I I<br>L2 L2 L2 L2 L2 L2 L2<br>L1 L1 L1 L1 L1 L1 L1 L1 | 2 64MB Shr 64MB Shr                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                                                                                        | eDRAM L3     3 L3s and 24 L1 / L2s     eDRAM L3       L2     L2 <td< td=""></td<> |
| L1: 64KI + 96KD<br>6w DL1, 4w IL1<br>256B line size                                    | L1: 96KI + 128KD<br>8w DL1, 6w IL1<br>256B line size                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| L2 Private 1MB Inclusive of DL1<br>Private 1MB Inclusive of IL1                        | L2 Private 2MB Inclusive of DL1<br>Private 2MB Inclusive of IL1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| L3 Shared 48MB Inclusive of L2s<br>12w Set Associative<br>256B cache line size         | L3 Shared 64MB Inclusive of L2s<br>16w Set Associative<br>256B cache line size                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| L4 384MB Inclusive<br>24w Set Associative<br>256B cache line size                      | L4 480MB + 224MB NonData Inclusive Coherent<br>Directory<br>30W Set Associative                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| zEC12 (Per Book)                                                                       | 256B cache line size<br>z13 (half of CPC drawer node)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |

© 2015 IBM Corporation

### **z13 Processor Overview**

- 2X Instruction pipe width
  - Improves IPC for all modes
  - Symmetry simplifies dispatch/issue rules
  - Required for effective SMT
- Added FXU and BFU execution units
  - 4 FXUs
  - 2 BFUs, DFUs
  - 2 new SIMD units
- SIMD unit plus additional registers
- Pipe depth re-optimized for power/performance
  - Product frequency reduced
  - Processor performance increased
- SMT support
  - Wide, symmetric pipeline
  - Full architected state per thread
  - SMT-adjusted CPU usage metering



### PU Chip Floorplan



PU Core Floorplan

© 2015 IBM Corporation



The z13 high-level instruction and execution flow.

IBM.



The z13 Microprocessor pipeline and SMT operation. Snapshot showing simultaneous execution of instructions from thread 0 and thread 1 in pipeline stages.



BTB2 increased from 4k 6-way to 16k 6-way.



Private L1 and L2 caches connected to shared L3 cache.

## Simultaneous Multithreading (SMT)

### **Simultaneous Multithreading - Background**

#### • SMT enables to run multiple threads on a single core

- Other processor families (i.e. x86, etc.) already have similar support
- Each thread runs slower than a non-SMT core, but the combined 'threads' throughput is higher. The overall throughput benefit depends on the workload

### Hardware support

- Single thread (ST) operation
- SMT operation with seamless transition between ST and SMT
- Precise metering of SMT utilization => Monitors Dashboard
- Software must actually enable the use of SMT operation
  - You must have software at levels that can exploit SMT.
  - Use of SMT is on a per-LPAR basis
  - The support is present
    - The OS(es) must actually issue instructions to switch into SMT mode
  - The SMT switch is uni-directional.
    - Once the OS switches, the only way back to ST mode is via a disruptive action (reactivate the partition or to re-IPL it).

## Simultaneous Multithreading (SMT)

- Simultaneous multithreading allows instructions from one or two threads to execute on a Integrated Facility for Linux (IFL) or the IBM z Integrated Information Processor (zIIP) processor core.
- SMT helps to address memory latency, resulting in an overall capacity\* (throughput) improvement per core
- SMT can be turned on or off on an LPAR by LPAR basis by operating system parameters. z/OS can also do this dynamically with operator commands.



Which approach is designed for the highest volume\*\* of traffic? Which road is faster? \*\* Two lanes at 50 carry 25% more volume if traffic density per lane is equal



### **Simultaneous Multithreading – The Technology**

- Simultaneous Multithreading (SMT) technology
  - Multiple programs (software threads) run on the same processor core
  - More efficient use of the core hardware
- Active threads share core resources
  - In space: data and instruction caches, TLBs, branch history tables, etc.
  - In time: pipeline slots, execution units, address translator, etc.
- Increases overall throughput per core when SMT is active
  - Amount that increase, varies widely with workload typically 1.X-1.Y >1
  - Each thread runs more slowly than on a single-thread core





## z13 - Simultaneous Multithreading (SMT)

#### z13 is the first z System Processor to support SMT

- Enable continued scaling of per-processor capacity
- z13 supports 2 threads per core on IFLs and zIIPs *only*
- Increases per-core and system throughput versus single thread design
  - More work done per unit hardware
  - Aligns with industry direction of multi-thread
  - Improves per-core performance comparisons vs. X86, POWER
  - Improves efficiency of IFL for Linux consolidation
- Designed to preserve unique z System values and attributes
  - Full support for 2-level processor virtualization
  - Full z/Architecture capability for each thread
- Design will allow independent enablement of SMT by LPAR
  - Operating systems must be explicitly enabled for SMT
  - Operating system may opt to run in single-thread mode
- Processors can run in single-thread operation for workloads needing maximum thread speed
- Functionally transparent to middleware and applications
  - No changes required to run in SMT partition
  - Operating System/Hypervisor Support
  - z/OS (for zIIPs) at GA
  - zVM (for IFLs) at GA
  - Linux: IBM is working with its Linux Distribution partners to support new functions/features

### **SMT Support Implementation**

### CPU address expansion

- -Without SMT
  - CPU x0014 = 0000 0000 0001 0100
- -With SMT
  - Core x0014 thread 0 = 0000 0000 0010 1000 (CPU x0028)
  - Core x0014 thread 1 = 0000 0000 0010 1001 (CPU x0029)
- -Non-IFL processor odd address unavailable or unused





### **z13 Core Virtualization**

- CPU Address changes with SMT
  - Sixteen bit CPU Id consists of a fifteen bit Core ID and one bit Thread ID



- CPU ID 6 (b'00000000000110') means core 3 Thread 0
- CPU ID 7 (b'00000000000111') means core 3 Thread 1
- On z13, z/OS will support SMT for zIIPs and z/VM will support SMT for IFLs
- For CPs only Thread 0 usable on each core
- SMT aware Hypervisors (z/VM) or Operating Systems (z/OS) must Opt-in at IPL to exploit SMT over the life of IPL
  - Hardware makes both threads usable on each core



### z System SMT Exploitation



### SMT Aware OS informs PR/SM that it intends to exploit SMT

- PR/SM can dispatch any OS core to any physical core
- OS control the whole core must follow rules
  - Maximize core throughput (Drive cores with high Thread Density [2])
    - Maximize core availability (Meet workload goals using fewest cores )

### SMT is transparent to applications

#### IBM z Systems

## Standardized virtualization for z System

SOD at announcement for KVM optimized for z System

- Expanded audience for Linux on z Systems
  - KVM on z System will co-exist with z/VM
  - Attracting new clients with in house KVM skills
  - Simplified startup with standard KVM interfaces
- Support of modernized open source KVM hypervisor for Linux
  - Provisioning, mobility, memory over-commit
  - Standard management and operational controls
  - Simplicity and familiarity for Intel Linux users
- Optimized for z System scalability, performance, security and resiliency
  - Standard software distribution from IBM
- Flexible *integration to cloud* offerings
  - Standard use of storage and networking drivers (including SCSI disk)
  - No proprietary agent management
  - Off-the-shelf OpenStack and cloud drivers
  - Standard enterprise monitoring and automation (i.e. GDPS)

All statements regarding IBM's plans, directions, and intent are subject to change or withdrawal without notice. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM.





# **Single Instruction Multiple Data (SIMD)**

https://share.confex.com/share/124/webprogram/Session16897.html IBM z Systems z13 Vector Extension Facility (SIMD)

### SIMD (Single Instruction Multiple Data) processing

### Increased parallelism to enable analytics processing

- Smaller amount of code helps improve execution efficiency
- Process elements in parallel enabling more iterations
- Supports analytics, compression, cryptography, video/imaging processing

### 





Value

Offload CPU

Simplify coding

Enable new applications

### SIMD (Single Instruction Multiple Data) Processing Example



- (Significantly) smaller amount of code => improved execution efficiency
- Number of elements processed in parallel = (size of SIMD / size of element)



### **Overlaid Vector/FPR register files**

- Initial implementation: 32 x 128b Vector Registers
  - Both dimensions may grow in future
- Vector register file overlays the FPRs
  - FPRs 0-15 == Bits 0:63 of SIMD regs 0-15
  - Update to FPR <x> alters entire SIMD register
     <x>
- Why overlay?
  - Saves hardware area / power
  - Easier mixing of scalar / SIMD code
    - Less copying of values between registers
  - Effectively get 64 FPRs
    - Can improve FP code efficiency





## SIMD Exploitation

- Provide optimized SIMD math & linear algebra libraries that will minimize the effort on the part of middleware/application developers
- Provide compiler built-in functions for SIMD that software applications can leverage as needed (e.g. for use of string instructions)
- String Millicode Instructions (Translate, Compare Logical String, Compare Until Substring Equal)
- Java.Next
  - Accelerate string, converter, array operations etc Idiomatic auto-vectorization (eg. simple loops)
  - ٠

| Workloads                          |                                                                    |                                                       |  |  |  |  |  |
|------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------|--|--|--|--|--|
| Java.Next                          | C/C++Compiler built-ins<br>for SIMD operations<br>(zOS and zLinux) | MASS & ATLAS<br>Math Libraries<br>(zOS and<br>zLinux) |  |  |  |  |  |
| SIMD Registers and Instruction Set |                                                                    |                                                       |  |  |  |  |  |

z13TLLB36 <sup>1</sup> All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.



# **SIMD Instructions**

- SUPPORT Loads, Stores, Moves
- INTEGER ARITHMETIC
- FLOATING-POINT ARITHMETIC
- STRING ACCELERATION



- VECTOR LOAD
  - VL VR<sub>1</sub>,  $D_2(X_2, B_2)$
  - Load 16 bytes from storage into VR<sub>1</sub>. **No alignment requirement**
- VECTOR LOAD AND REPLICATE
  - VLRP(B|H|F|G) VR<sub>1</sub>,D<sub>2</sub>(X<sub>2</sub>,B<sub>2</sub>),M<sub>3</sub>
  - Load 1-8 bytes and replicate across all elements of VR<sub>1</sub>
- VECTOR LOAD ELEMENT
  - VLE(B|H|W|D) VR<sub>1</sub>,D<sub>2</sub>(X<sub>2</sub>,B<sub>2</sub>),M<sub>3</sub>
  - The element sized second operand is placed into VR<sub>1</sub> at index M<sub>3</sub>
- VECTOR LOAD ELEMENT IMMEDIATE
  - $\quad VLEI(B|H|F|G) \ VR_1,I_2,M_3$
  - Places  $I_2$  in VR<sub>1</sub> at index  $M_3$ , leaves rest of vector unchanged
- VECTOR LOAD MULTIPLE
  - $\quad VLM \ VR_1, VR_3, D_2(B_2), M_4$
  - Up to 16 VRs loaded from storage
- VECTOR LOAD TO BLOCK BOUNDARY
  - VLBB VR<sub>1</sub>,D2(X<sub>2</sub>,B<sub>2</sub>),M<sub>3</sub>
  - Loads up to 16 bytes int VR<sub>1</sub> without crossing block boundary specified by M<sub>3</sub>
- LOAD COUNT TO BLOCK BOUNDARY
  - LCBB R<sub>1</sub>,D<sub>2</sub>(X<sub>2</sub>,B<sub>2</sub>),M<sub>3</sub>
  - Loads R<sub>1</sub> with number of bytes that can be loaded with specified block size
- VECTOR LOAD WITH LENGTH
  - VLL VR<sub>1</sub>,D<sub>2</sub>(B<sub>2</sub>),R<sub>3</sub>
  - Loads the number of bytes specified in  $R_3$  from storage into  $VR_1$
- VECTOR LOAD LOGICAL ELEMENT AND ZERO
  - $\quad \text{VLLEZ} \ (B|H|F|G) \ VR_1, D_2(X_2, B_2), M_3$
  - Load element sized data from second operand address and place right justified in leftmost DW
- VECTOR GATHER ELEMENT
  - VGEF(VGEG) VR<sub>1</sub>,D<sub>2</sub>(V<sub>2</sub>,B<sub>2</sub>),M<sub>3</sub>
  - Loads element from memory addressed by  $B_2+V_2(M_3)+D_2$

© 2015 IBM Corporation





## **Vector Store Instructions**

- VECTOR STORE
  - VST VR<sub>1</sub>,  $D_2(X_2, B_2)$
  - Stores 16 bytes on byte boundary, no alignment required
- VECTOR STORE ELEMENT
  - VSTE(B|H|F|G) VR<sub>1</sub>,  $D_2(X_2, B_2)$ ,  $M_3$
  - Stores element of VR<sub>1</sub> indexed by M<sub>3</sub> to second operand
- VECTOR STORE MULTIPLE
  - VSTM  $VR_1$ ,  $VR_3$ ,  $D_2(B_2)$ ,  $M_4$
  - Stores range of up to 16 VRs to second operand location
- VECTOR STORE WITH LENGTH
  - VSTL VR<sub>1</sub>,  $D_2(B_2)$ ,  $R_3$
  - Stores the number of bytes specified by  $R_3$  from  $VR_1$  into the second operand location
- VECTOR SCATTER ELEMENT
  - $VSCEF(VSCEG) VR_1, D_2(V_2, B_2), M_3$
  - Stores element of VR<sub>1</sub> indexed by  $M_3$  to memory addressed by  $B_2+V_2(M_3)+D_2$



# z Systems Crypto



# Where is the Coprocessor located on the PU core?





## z13 Compression and Cryptography Accelerator

### Coprocessor dedicated to each core (was shared by two cores on z196)

- Independent compression engine
- Independent cryptographic engine
- Available to any processor type (CP, zIIP, IFL)
- Owning processor is busy when its coprocessor is busy
- Instructions available to any processor type

### Data compression/expansion engine

- Static dictionary compression and expansion

### CP Assist for Cryptographic Function

- Supported by z/OS, z/VM, z/VSE, z/TPF, and Linux on z Systems
- DES, TDES

- Clear and Protected Key
- AES128, 192, 256
- Clear and Protected Key
- SHA-1 (160 bit)
- Clear Key
- SHA-256, -384, -512 Clear Key
- PRNG
- DRNG

- Clear Key
  Clear Key
- CPACF FC 3863 (No Charge Export Control) is required to enable some functions and to support Crypto Express5S or Crypto Express4S

# **CPACF - <u>CP</u> Assist For <u>Cryptographic Functions</u>**



| Supported<br>Algorithms | Clear<br>Key | Protect<br>Key |
|-------------------------|--------------|----------------|
| DES, T-DES              | Y            | Y              |
| AES128                  | Υ            | Y              |
| AES192                  | Υ            | Y              |
| AES256                  | Y            | Y              |
| SHA-1                   | Y            | N/A            |
| SHA-256                 | Υ            | N/A            |
| SHA-384                 | Υ            | N/A            |
| SHA-512                 | Y            | N/A            |
| PRNG                    | Y            | N/A            |
| DRNG                    | Υ            | N/A            |
|                         |              |                |

- Provides a set of symmetric cryptographic functions and hashing functions for:
  - Data privacy and confidentiality
  - Data integrity
  - Random Number generation
  - Message Authentication

### Enhances the encryption/decryption performance of clear-key operations for

- SSL
- VPN
- Data storing applications
- Available on every Processor Unit
- Supported by z/OS, z/VM, z/VSE, z/TPF and Linux on z Systems
- Must be explicitly enabled, using a no-charge enablement feature (#3863),
  - SHA algorithms enabled with each server
- Protected key support for additional security of cryptographic keys
  - Crypto Express4s or Crypto Express5S required in CCA mode

## z13 CPACF

CP Assist for Cryptographic Function Co-processor redesigned from "ground up"

### Enhanced performance over zEC12

- Does not include overhead for COP start/end and cache effects
- Enhanced performance for large blocks of data
  - AES: 2x throughput vs. zEC12
  - TDES: 2x throughput vs. zEC12
  - SHA: 3.5x throughput vs. zEC12



- Exploiters of the CPACF benefit from exploited by the throughput improvements of z13's CPACF such as:
  - DB2/IMS encryption tool
  - DB2® built in encryption
  - z/OS Communication Server: IPsec/IKE/AT-TLS
  - z/OS System SSL
  - z/OS Network Authentication Service (Kerberos)
  - DFDSS Volume encryption
  - z/OS Java SDK
  - z/OS Encryption Facility
  - Linux on z Systems; kernel, openssl, openCryptoki, GSKIT

IBM,

## **Overview – HW Crypto support in z Systems**



© 2015 IBM Corporation



## Crypto Express5S Standards supported

- DES/TDES w DES/TDES MAC/CMAC
- AES, AESKW, AES GMAC, AES GCM, AES XTS mode, CMAC
- MD5, SHA-1, SHA-2 (224,256,384,512), HMAC
- VISA Format Preserving Encryption (VFPE)
- RSA (512, 1024, 2048, 4096) -> Performance improvement
- ECDSA (192, 224, 256, 384, 521 Prime/NIST)
- ECDSA (160, 192, 224, 256, 320, 384, 512 BrainPool)
- ECDH (192, 224, 256, 384, 521 Prime/NIST)
- ECDH (160, 192, 224, 256, 320, 384, 512 BrainPool)
- Montgomery Modular Math Engine
- RNG (Random Number Generator)
- PNG (Prime Number Generator) -> NEW
- Clear Key Fast Path (Symmetric and Asymmetric)

## **IBM z13 – Taking Java Performance to the Next Level**

Continued aggressive investment in Java on Z

Significant set of new hardware features tailored and co-designed with Java

## Simultaneous Multi-Threading (SMT)

- 2x hardware threads/core for improved throughput
- Available on zIIPs and IFLs

## Single Instruction Multiple Data (SIMD)

- Vector processing unit
- Accelerates loops and string operations

## Cryptographic Function (CPACF)

Improved performance of crypto co-processors

## **New Instructions**

Up to **50%** improvement for generic applications Up to **2X** improvement in throughput per core for security enabled applications







## Accelerating using SIMD with IBM Java 8 and z13

## IBM z13 running Java 8 on z/OS Single Instruction Multiple Data (SIMD) vector engine exploitation

### java.lang.String exploitation

- compareTo
- compareTolgnoreCase
- contains
- contentEquals
- equals
- indexOf
- lastIndexOf
- regionMatches
- toLowerCase
- toUpperCase
- getBytes

## java.util.Arrays

equals (primitive types)

### String encoding converters

For ISO8859-1, ASCII, UTF8, and UTF16

- encode (char2byte)
- decode (byte2har)

## Auto-SIMD

 Simple loops (eg. Matrix multiplication)

## Primitive operations are between 1.6x and 60x faster with IBM Java8

IBM.



© 2015 IBM Corporation