RAS Features of the Intel Xeon Scalable Processors on Lenovo ThinkSystem ServersArticle
Author
Updated
31 Mar 2022Form Number
LP1571PDF size
9 pages, 108 KBAbstract
Server reliability, availability, and serviceability (RAS) are crucial issues for modern enterprise IT shops that deliver mission-critical applications and services, and application delivery failures can be extremely costly per hour of system downtime. Intel Xeon Scalable Processors running on ThinkSystem servers continue to be at the top of the industry in regards to RAS features. This article explains the importance of RAS features on a server and a list of Key RAS features on the latest ThinkSystem servers Lenovo offers to customers looking to minimize downtime in their data center.
Introduction
Applications such as database, enterprise resource planning (ERP), customer resource management (CRM), and business intelligence (BI) applications need to be available 24/7 on a wide area or global basis. In addition, the likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments.
While clustering and virtualization can help meet availability requirements, they are not adequate solutions for very large databases, BI, and high-end transactional systems. A failure affecting a single core business application can easily cost hundreds of thousands or even millions of dollars per hour. All this leads to a need for scalable and highly resilient servers that are well suited for critical business applications and large-scale consolidation.
Always On
Time is money. Even a few minutes of downtime can result in significant costs and cause internal business operations to come to a standstill. Downtime can also adversely impact a company’s relationship with its customers, business suppliers and partners. Reliability or lack thereof can potentially damage a company’s reputation and result in lost business.
The growth of new applications has ratcheted database processing and business analytics to the top of the list for server workloads. These workloads demand continuous availability from the enterprise platforms on which they run.
"Always on" has become a global requirement and impacts many aspects of our lives:
- Maximize productivity - Manufacturers need to keep their production line up and running. System downtime should not interrupt it.
- Control access - Facility Security companies prevent external threats to organizations. Security application downtime shouldn't be an internal threat.
- Protect profit - Retailers have sales targets to meet day in, day out. Transaction system downtime shouldn’t get in the way.
- Protect lives - First Responders take care of emergencies 24 x 7 x 365. Application downtime shouldn’t be one of them.
- Ensure quality care and privacy - Healthcare Institutions need to access patient information and be HIPPA compliant all the time. System downtime shouldn’t compromise either one.
- Process transactions - Financial Services organizations manage thousands of transactions a second. Processing system downtime simply can’t happen
The Cost of Downtime
The ITIC 2021 survey found that 99% of organizations say that a single hour of downtime costs over $100,000; 91% of respondents indicated that 60 minutes of downtime costs their business over $300,000 and a record one-third or 44% of enterprises report that one hour of downtime costs their firms $1 million to over $5 million.
Figure 1. Cost of hourly downtime in enterprises, 2020-2021
Server RAS Defined
RAS in relation to servers is defined as follows:
Reliability – Reducing the mean time between hardware failures and ensuring data integrity. Data integrity is protected through error detection and correction — or, if not correctable, error containment
- Error Detection and Self-Healing
- Minimizes outage opportunities
- Correct results continuously
Availability – Refers to uninterrupted system and application operation even in the presence of uncorrectable errors
- Reduce frequency and duration of outages
- Self-diagnosing: work around faulty components or “self-heal”
- Never stops or slows down
Serviceability – Means a system can be maintained without disrupting operation. This capability requires both thoughtful platform design and innovative systems management.
- Avoid repeat failures with accurate diagnostics
- Concurrent repair on higher failure rate items
- Easy to repair and upgrade
RAS Features of Lenovo ThinkSystem Servers with Intel Xeon Scalable Processors
The following is a list of Key RAS Features of the Intel Xeon Scalable Processors on the Lenovo ThinkSystem servers.
Platinum and Gold level processors support both Advanced and Standard RAS features. Silver processors only support Standard RAS features.
Advanced RAS features | Category | 2nd Gen Intel Xeon Scalable Processors ("Purley") | 4-Socket 3rd Gen Intel Xeon Scalable Processors ("Cedar Island") | 2-Socket 3rd Gen Intel Xeon Scalable Processors ("Whitley") | Benefit |
---|---|---|---|---|---|
Viral Mode of error containment | Reliability | Yes | Yes | Yes | Enhanced error containment to improve data integrity, complimentary to corrupt data containment mode |
Local Machine Check (LMCE) based Recovery | Reliability | Yes | Yes | Yes | Enhances MCA recovery-Execution path event, and increases the possibility of recovery |
SDDC +1, Adaptive DDDC (MR) +1 | Reliability | Yes | Yes | Yes | Adaptive virtual lockstep delivers up to two DRAM Device corrections. Also supports Single DRAM correction, as well as single bit correction post final DRAM device map out. |
PCI Express Live Error Recovery | Reliability | Yes | Yes | Yes | PCI-e root port error containment, and the opportunity to dynamically recover from the error |
Intel® UPI Dynamic Link width reduction | Availability | Yes | Yes | Yes | Enables interconnect to continue operation in presence of Interconnect link persistent failure |
Address range/Partial Memory Mirroring | Reliability | Yes | Yes | Yes | OS managed memory mirroring of selective ranges, increases data integrity at efficient cost |
MCA 2.0 Recovery (as per eMCA gen2 architecture) | Reliability | Yes | Yes | Yes | Firmware first model enables a reliable error sourcing capability with the ability to write to the MSR |
Standard RAS features | Category | 2nd Gen Intel Xeon Scalable Processors ("Purley") | 4-Socket 3rd Gen Intel Xeon Scalable Processors ("Cedar Island") | 2-Socket 3rd Gen Intel Xeon Scalable Processors ("Whitley") | Benefit |
Advanced Error Detection and Correction (AEDC) | Reliability | Yes | Yes | Yes | Enhanced fault coverage within processor cores, and attempt to recover via instruction retry |
Error Detection and Correction | Reliability | Yes | Yes | Yes | Extensive Error detection and correction capability across the silicon, and the interconnects. |
Corrupt Data containment-Core | Reliability | Yes | Yes | Yes | Uncorrectable data explicitly marked and delivered synchronously to the consuming core to assist error containment and increase system reliability |
Corrupt Data containment-UnCore | Reliability | Yes | Yes | Yes | Uncorrectable data explicitly marked and delivered synchronously to the requestor, to assist error containment and increase system reliability |
SDDC, Adaptive Data Correction (SR) | Reliability | Yes | Yes | Yes | Adaptive virtual lockstep delivers single DRAM Device corrections, at bank granularity. Also supports Single DRAM correction. |
PCIe “Stop and Scream” | Reliability | Yes | Yes | Yes | PCI-e root port corrupt data containment feature, increases data integrity |
Memory Mirroring- Intra iMC | Reliability | Yes | Yes | Yes | Increase data integrity by creating a redundant/mirrored copy of data in system DRAM |
DDR4 memory RANK Sparing | Reliability | Yes | Yes | No | Reserved/spare DRAM RANKs are utilized to dynamically map out the failing DRAM RANK into the spare Ranks. |
Predictive Failure Analysis | Serviceability | Yes | Yes | Yes | Extensive error logs to assist software in predicting failures |
Failed DIMM Isolation | Serviceability | Yes | Yes | Yes | Extensive error logs to help software identify the failing DIMM |
Virtual (soft) Partitioning | Reliability | Yes | Yes | Yes | Virtual Machine Monitor ability to make use of hardware recovery , signaling and error logs |
Error reporting via IOMCA | Serviceability | Yes | Yes | Yes | Unified error reporting of the IIO logic to the OS |
Error reporting through MCA 2.0 (eMCA gen2) | Serviceability | Yes | Yes | Yes | Firmware first model enables a reliable error sourcing capability |
Error reporting through eMCA gen1 | Serviceability | Yes | Yes | Yes | Firmware first model enables reliable error sourcing capability |
PCIe Card Hot Plug NVMe (Add, Remove, Swap) | Serviceability | Yes | Yes | Yes | Hot add and repalce of NVMe drives |
PCI Express ECRC | Reliability | Yes | Yes | Yes | PCI Express End to end CRC checking, increasing system reliability |
PCIe Corrupt Data Containment (Data Poisoning) | Reliability | Yes | Yes | Yes | PCIe corrupt data mode of operation, synchronous signaling of the corrupted data along with data, increases system reliability |
PCIe Link CRC Error Check and Retry | Reliability | Yes | Yes | Yes | PCIe link CRC error check and retry, system reliability and recovery from transient errors |
PCIe Link Retraining and Recovery | Reliability | Yes | Yes | Yes | PCIe link retraining and attempted recovery from persistent link transient errors |
Mem SMBus hang recovery | Reliability | Yes | Yes | Yes | Software ability to reset memory SMBus interface to recover from hang condition |
DDR4 Command/ Address Parity Check and Retry | Reliability | Yes | Yes | Yes | DDR4 Address and command parity check and retry in the event of errors |
Time-out timer Schemes | Serviceability | Yes | Yes | Yes | Hierarchy of transaction time outs to assist system debug and reliable error sourcing. |
Intel® UPI Link Level Retry | Reliability | Yes | Yes | Yes | Intel UPI link’s ability to perform CRC check and retry on errors for higher degree of system reliability |
Intel® UPI Protocol Protection via 16 bit Rolling CRC | Reliability | Yes | Yes | Yes | Detection of transient data errors over Intel UPI interconnects, via 16bit CRC error checking |
Processor BIST | Serviceability | Yes | Yes | Yes | At power up, Processor’s built in self test engine performs test on the internal cache structure for and provides the results to the system BIOS |
Socket disable for FRB | Availability | Yes | Yes | Yes | The capability to selectively disable socket at the boot time, and therefore allowing system to power-on in a failover configuration |
Core disable for FRB | Availability | Yes | Yes | Yes | The capability to disable failing cores at boot time, map out the failing core |
PIROM for System Information Storage | Serviceability | Yes | Yes | Yes | On package Processor Information ROM |
MCA Recovery-Execution path | Reliability | Yes | Yes | Yes | OS layer assisted recovery from uncorrectable data errors to prevent system reset |
MCA Recovery-Non execution path | Reliability | Yes | Yes | Yes | OS layer assisted recovery from uncorrectable data errors detected by Patrol scrubber or LLC Explicit Write Back |
DCU Scrubbing | Reliability | No | No | Yes | Improves system uptime by minimizing the impact due to high energy particle strike (Soft Errors) within core DCU (aka L1D cache) |
Partial Cache Line Sparing (PCLS) | Reliability | No | No | Yes | Extend system uptime in case a single bit memory persistent error is detected. It allows mapping out the cache line with failed single bit by using spare capacity available within the IMC |
PCIe Enhanced Downstream Port Containment (EDPC) | Reliability | No | No | Yes | EDPC is an enhancement to the Downstream Port Containment (DPC) thereby adding Root Port Programmable IO (RPPIO) errors. |
DCU/IFU Error Handling Enhancement | Reliability | No | No | Yes | Improving MCA Recovery (Execution path) coverage in case CPU core (DCU/IFU) receives multiple back-to-back read data with ‘Poison Error’ (aka Poison Storm). |
Conclusion
Lenovo ThinkSystem servers equipped with the newest generations of Intel Xeon Scalable Processors have maintained RAS features leadership. This translates into Reliability, Serviceability and Availability for all types of workloads required by enterprises, saving thousands of dollars by avoiding downtime of service.
About the author
Randall Lundin is the Mission Critical Product Manager in the Lenovo Infrastructure Solutions Group. He is responsible for managing and planning Lenovo’s 4-socket and 8-socket servers. Randall has also authored and contributed to numerous Lenovo Press publications in the Mission Critical space.
This article is one in a series on the ThinkSystem SR850 V2 and SR860 V2 servers:
- Five Highlights of the Lenovo ThinkSystem SR850 V2
- Five Highlights of the Lenovo ThinkSystem SR860 V2
- Why Scale-Up With 4S and 8S Servers?
- Unique Intel Features Available with ThinkSystem SR850 V2 and SR860 V2
- ThinkSystem SR860 V2 is the New 4S Performance Leader
- The Value of Refreshing Your 4-Socket Servers with the ThinkSystem SR860 V2 and SR850 V2
- The Perfect 4-Socket and 8-Socket Servers for SAP HANA
- Total Cost of Ownership Comparison of Running SAP HANA on Lenovo ThinkSystem Servers
- RAS Features of the Intel Xeon Scalable Processors on Lenovo ThinkSystem Servers
Related product families
Product families related to this document are the following:
Trademarks
Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.
The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
ThinkSystem
The following terms are trademarks of other companies:
Intel® and Xeon® are trademarks of Intel Corporation or its subsidiaries.
Other company, product, or service names may be trademarks or service marks of others.