RAS Features of the Lenovo ThinkSystem Intel Servers

Server reliability, availability, and serviceability (RAS) are crucial issues for modern enterprise IT shops that deliver mission-critical applications and services, and application delivery failures can be extremely costly per hour of system downtime. Intel Xeon Scalable processors running on ThinkSystem servers continue to be at the top of the industry in regards to RAS features. This article explains the importance of RAS features on a server and a list of Key RAS features on the latest ThinkSystem servers Lenovo offers to customers looking to minimize downtime in their data center.

Changes in the July 31, 2025 update:

Changed the following under - Server RAS Defined section
- New image for "Hourly Cost of Downtime"
- New image for "Unplanned Downtime by Server Hardware Platform in November 2024"
Added 6th Gen processors to the table of RAS features - RAS features with Intel Xeon Scalable processors section

Introduction

Applications such as database, enterprise resource planning (ERP), customer resource management (CRM), and business intelligence (BI) applications need to be available 24 x 7 on a wide area or global basis. In addition, the likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments.

While clustering and virtualization can help meet availability requirements, they are not adequate solutions for very large databases, BI, and high-end transactional systems. A failure affecting a single core business application can easily cost hundreds of thousands or even millions of dollars per hour. All this leads to a need for scalable and highly resilient servers that are well suited for critical business applications and large-scale consolidation.

Always On

Time is money. Even a few minutes of downtime can result in significant costs and cause internal business operations to come to a standstill. Downtime can also adversely impact a company’s relationship with its customers, business suppliers and partners. Reliability or lack thereof can potentially damage a company’s reputation and result in lost business.

The growth of new applications has ratcheted database processing and business analytics to the top of the list for server workloads. These workloads demand continuous availability from the enterprise platforms on which they run.

"Always on" has become a global requirement and impacts many aspects of our lives:

Maximize productivity - Manufacturers need to keep their production line up and running. System downtime should not interrupt it.
Control access - Facility Security companies prevent external threats to organizations. Security application downtime shouldn't be an internal threat.
Protect profit - Retailers have sales targets to meet day in, day out. Transaction system downtime shouldn’t get in the way.
Protect lives - First Responders take care of emergencies 24 x 7 x 365. Application downtime shouldn’t be one of them.
Ensure quality care and privacy - Healthcare Institutions need to access patient information and be HIPPA compliant all the time. System downtime shouldn’t compromise either one.
Process transactions - Financial Services organizations manage thousands of transactions a second. Processing system downtime simply can’t happen

Server RAS Defined

RAS in relation to servers is defined as follows:

Reliability – Reducing the mean time between hardware failures and ensuring data integrity. Data integrity is protected through error detection and correction — or, if not correctable, error containment

Error Detection and Self-Healing
Minimizes outage opportunities
Correct results continuously

Availability – Refers to uninterrupted system and application operation even in the presence of uncorrectable errors

Reduce frequency and duration of outages
Self-diagnosing: work around faulty components or “self-heal”
Never stops or slows down

Serviceability – Means a system can be maintained without disrupting operation. This capability requires both thoughtful platform design and innovative systems management.

Avoid repeat failures with accurate diagnostics
Concurrent repair on higher failure rate items
Easy to repair and upgrade

Industry Cost of Downtime

93% of Companies survived say that 1 hour of downtime costs exceed $300K.

Figure 1. Hourly Cost of Downtime (from ITIC 2024 Global Server Hardware, Server OS Reliability Survey)

Unplanned Downtime by Hardware Platform

According to ITIC, Lenovo ThinkSystem and IBM lead the way for unplanned downtime

Lenovo ThinkSystem: 315 milliseconds
Cisco UCS: 20 minutes
HPE Superdome: 1.35 minutes
Dell PowerEdge: 20 minutes
HPE ProLiant: 30 minutes

Figure 2. Unplanned Downtime by Server Hardware Platform in November 2024 (from ITIC 2024 Global Server Hardware/Server OS Reliability Survey)

RAS features with Intel Xeon Scalable processors

The following table is a list of Key RAS Features of the Intel Xeon Scalable processors on the Lenovo ThinkSystem servers.

Table 1. RAS features of Lenovo ThinkSystem servers with Intel Xeon Scalable processors
Feature	Category	2nd Gen Intel (CLX)	3rd Gen Intel 4S (CPX)	3rd Gen Intel 2S (ICX)	4th Gen Intel (SPR)	Intel Xeon Max Series (SPR HBM)	5th Gen Intel (EMR)	Xeon® 6 (Sierra Forest /Granite Rapids)	Benefit
Advanced RAS features
Viral Mode of error containment	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Enhanced error containment to improve data integrity, complimentary to corrupt data containment mode
Local Machine Check (LMCE) based Recovery	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Enhances MCA recovery-Execution path event, and increases the possibility of recovery
SDDC +1, Adaptive DDDC (ADDDC) (MR) +1*	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Adaptive virtual lockstep delivers up to two DRAM Device sub-region(s) sparing/correction at bank and/or full rank granularity. Also supports Single DRAM correction, as well as single bit correction post final DRAM device map out.
PCI Express Live Error Recovery	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	PCI-e root port error containment, and the opportunity to dynamically recover from the error
Intel® UPI Dynamic Link width reduction	Availability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Enables interconnect to continue operation in presence of Interconnect link persistent failure
Address range/Partial Memory Mirroring	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	OS managed memory mirroring of selective ranges, increases data integrity at efficient cost
MCA 2.0 Recovery (as per eMCA gen2 architecture)	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Firmware first model enables a reliable error sourcing capability with the ability to write to the MSR
Standard RAS features
Advanced Error Detection and Correction (AEDC)	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Enhanced fault coverage within processor cores, and attempt to recover via instruction retry. Also known as Advanced ECC or AECC.
Error Detection and Correction	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Extensive Error detection and correction capability across the silicon, and the interconnects.
Corrupt Data containment-Core	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Uncorrectable data explicitly marked and delivered synchronously to the consuming core to assist error containment and increase system reliability
Corrupt Data containment-UnCore	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Uncorrectable data explicitly marked and delivered synchronously to the requestor, to assist error containment and increase system reliability
SDDC, Adaptive Data Correction (ADC) (SR)*	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Adaptive virtual lockstep delivers single DRAM Device sub-region sparing/correction at bank granularity. Also supports Single DRAM correction.
PCIe “Stop and Scream”	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	PCI-e root port corrupt data containment feature, increases data integrity
Memory Mirroring- Intra iMC	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Increase data integrity by creating a redundant/mirrored copy of data in system DRAM
Rank/Multi Rank Sparing & DRAM sparing	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Reserved/spare ranks are utilized to dynamically map out the failing rank into the spare ranks. Starting with 3rd Gen “Ice Lake” processors, this is implemented using ADDDC/ ADC-SR/ ADDDC-MR to provide DRAM-level sparing feature support.
Predictive Failure Analysis	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Extensive error logs to assist software in predicting failures
Failed DIMM Isolation	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Extensive error logs to help software identify the failing DIMM
Virtual (soft) Partitioning	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Virtual Machine Monitor ability to make use of hardware recovery , signaling and error logs
Error reporting via IOMCA	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Unified error reporting of the IIO logic to the OS
Error reporting through MCA 2.0 (eMCA gen2)	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Firmware first model enables a reliable error sourcing capability
Error reporting through eMCA gen1	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Firmware first model enables reliable error sourcing capability
PCIe Card Hot Plug NVMe (Add, Remove, Swap)	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Hot add and repalce of NVMe drives
PCI Express ECRC	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	PCI Express End to end CRC checking, increasing system reliability
PCIe Corrupt Data Containment (Data Poisoning)	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	PCIe corrupt data mode of operation, synchronous signaling of the corrupted data along with data, increases system reliability
PCIe Link CRC Error Check and Retry	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	PCIe link CRC error check and retry, system reliability and recovery from transient errors
PCIe Link Retraining and Recovery	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	PCIe link retraining and attempted recovery from persistent link transient errors
Mem SMBus hang recovery	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Software ability to reset memory SMBus interface to recover from hang condition
DDR4 Command/ Address Parity Check and Retry	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	DDR4 Address and command parity check and retry in the event of errors
Time-out timer Schemes	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Hierarchy of transaction time outs to assist system debug and reliable error sourcing.
Intel® UPI Link Level Retry	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Intel UPI link’s ability to perform CRC check and retry on errors for higher degree of system reliability
Intel® UPI Protocol Protection via 16 bit Rolling CRC	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Detection of transient data errors over Intel UPI interconnects, via 16bit CRC error checking
Processor BIST	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	At power up, Processor’s built in self test engine performs test on the internal cache structure for and provides the results to the system BIOS
Socket disable for FRB	Availability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	The capability to selectively disable socket at the boot time, and therefore allowing system to power-on in a failover configuration
Core disable for FRB	Availability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	The capability to disable failing cores at boot time, map out the failing core
Memory disable for FRB	Availability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	The capability to disable failing DIMMs at boot time, map out the failing DIMMs
Memory demand and patrol Scrubbing	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	DRAM content is scrubbed with the corrected data due to Demand or patrol scrub operation. Scrubbing DRAM location can prevent accumulation of single-bit errors turning it into uncorrected error.
DDR Power Up Post Package Repair (PPR)	Availability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	The capability will map out bad rows identified by the platform and replace them with spare rows in DDR DIMMs. The repair action will be executed during the memory training phase at system power-up.
PIROM for System Information Storage	Serviceability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	On package Processor Information ROM
MCA Recovery-Execution path	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	OS layer assisted recovery from uncorrectable data errors to prevent system reset
MCA Recovery-Non execution path	Reliability	Yes	Yes	Yes	Yes	Yes	Yes	Yes	OS layer assisted recovery from uncorrectable data errors detected by Patrol scrubber or LLC Explicit Write Back
DCU Scrubbing	Reliability	No	No	Yes	Yes	Yes	Yes	Yes	Improves system uptime by minimizing the impact due to high energy particle strike (Soft Errors) within core DCU (aka L1D cache)
Partial Cache Line Sparing (PCLS)	Reliability	No	No	Yes	No	Yes	No	No	Extend system uptime in case a single bit memory persistent error is detected. It allows mapping out the cache line with failed single bit by using spare capacity available within the IMC
PCIe Enhanced Downstream Port Containment (EDPC)	Reliability	No	No	Yes	Yes	Yes	Yes	Yes	EDPC is an enhancement to the Downstream Port Containment (DPC) thereby adding Root Port Programmable IO (RPPIO) errors.
DCU/IFU Error Handling Enhancement	Reliability	No	No	Yes	Yes	Yes	Yes	Yes	Improving MCA Recovery (Execution path) coverage in case CPU core (DCU/IFU) receives multiple back-to-back read data with ‘Poison Error’ (aka Poison Storm).
Memory Permeant Fault Detection (PFD)	Reliability	No	No	No	Yes	Yes	Yes	Yes	Memory controller will use algrithm to detect and report permanent memory fault
DDR5 On-die Error Check and Scrub	Reliability	No	No	No	Yes	Yes	Yes	Yes	DDR5 has on-die ECC Error Check and Scrub (ECS) mode with an error counting scheme for transparency. The ECS mode allows the DRAM to internally read, correct single bit errors, and write back corrected data bits to the array (scrub errors) while providing transparency to error counts.
DDR5 PMIC Error Handling	Reliability	No	No	No	Yes	Yes	Yes	Yes	DDR5 has Power Management IC (PMIC) on DIMM, this capability will isolate the DIMM with PMIC error.
Runtime Post Package Repair (PPR)	Reliability	No	No	No	No	No	No	Yes**	The capability will map out what platform identifies as bad rows with spare rows in DDR DIMMs, and the repair action will be executed during runtime with soft Post Package Repair.
CXL managed hot-plug (Add/Remove/Swap)	Reliability	No	No	No	No	No	No	Yes	Only software-managed hot add/remove is supported in CXL 2.0, and not hot plug support for CXL Memory Module device yet.
CXL eDPC	Reliability	No	No	No	No	No	No	Yes	CXL eDPC is an enhancement to the Downstream Port Containment (DPC) thereby adding Root Port Programmable IO (RPPIO) errors. And no eDPC support for CXL Memory module device otherwise it will trigger memory uncorrectable error directly.
CXL link CRC error check and retry	Reliability	No	No	No	No	No	No	Yes	CXL link’s ability to perform CRC check and retry on errors for higher degree of system reliability
CXL link retraining and recovery	Reliability	No	No	No	No	No	No	Yes	CXL link retraining and attempted recovery from persistent link transient errors
CXL corrupt data containment (data poisoning)	Reliability	No	No	No	No	No	No	Yes	CXL corrupt data mode of operation, synchronous signaling of the corrupted data along with data, increases system reliability
CXL ECRC	Reliability	No	No	No	No	No	No	Yes	CXL End to end CRC checking, increasing system reliability
CXL Error reporting via IOMCA	Reliability	No	No	No	No	No	No	Yes	CXL unified error reporting of the IIO logic to the OS
CXL Memory Module RAS feature support	Reliability	No	No	No	No	No	No	Yes	Provide CXL Memory Module related error detection and isolation during POST, and also support memory correctable error and uncorrectable error handling during runtime.
RAS features for processors with High Bandwidth Memory (HBM)
HBM Only mode, 1LM, 2LM	Availability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	High Bandwidth Memory (HBM), it supports HBM Only mode (no DDR DIMMs in the system), 1LM mode (1 level memory: HBM acts same as DDR), and 2LM (2 levels memory: HBM acts as cache of DDR).
HBM Permeant Fault Detection (PFD)	Reliability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	Memory controller will use algrithm to detect and report HBM permanent fault.
HBM Command/Address parity check and retry	Reliability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	Memory controller will retry the transaction and try to recovery at CMD/ADDR parity errors on the CMD/ADDR bus.
HBM Power Up PPR	Availability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	The capability will maps out what platform identifies as bad rows with spare rows in HBM memory, and the repair action will be executed duiring HBM training phase at system power up.
HBM Demand and Patrol Scrub	Reliability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	Similar to DDR, HBM DRAM content is scrubbed with the corrected data due to Demand or patrol scrub operation. Scrubbing DRAM location can prevent accumulation of single-bit errors turning it into uncorrected error
HBM Memory bank sparing	Reliability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	Legacy Bank sparing operation, each pseudo channel of HBM supports spare bank for UEFI to make use of after it detects bank failing on the pseudo channel.
HBM Corrected Error Reporting	Reliability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	Memory Corrected Error reporting supports per rank corrected error counters with leaky bucket algorithm.
HBM Disable/Map out for FRB	Availability	N/A	N/A	N/A	N/A	Yes	N/A	N/A	The capability to disable failing HBM at boot time, map out the failing HBM.

* See additional information about ADDDC below
** Runtime Post Package Repair (PPR) is not supported by Sierra Forest but Granite Rapids supports it

All ThinkSystem servers with Intel Xeon Scalable processors support the ADC (SR or Single Region) and ADDDC (MR or Multiple Region) feature which can support DRAM sub-region(s) sparing at Bank and/or full Rank granularity, compared to the memory rank sparing feature.

Details of Adaptive Double DRAM Device Correction (ADDDC) from the Intel article "New Reliability, Availability, and Serviceability (RAS) Features in the Intel® Xeon® Processor Family" https://www.intel.com/content/www/us/en/developer/articles/technical/new-reliability-availability-and-serviceability-ras-features-in-the-intel-xeon-processor.html

Intel Xeon introduces an innovative approach in managing errors that the DDR4 DRAM DIMM may induce through the life of the product. ADDDC is deployed at runtime to dynamically map out the failing DRAM device and continue to provide SDDC ECC coverage on the DIMM, translating to longer DIMM longevity. The operation occurs at the fine granularity of DRAM Bank and/or Rank to have minimal impact on the overall system performance.

With the advent of ADDDC, the memory subsystem is always configured to operate in performance mode. When the number of corrections on a DRAM device reaches the targeted threshold value, with help from the UEFI runtime code, the identified failing DRAM region is adaptively placed in lockstep mode where the identified failing region of the DRAM device is mapped out of ECC. Once in ADDDC, cache line ECC continues to cover single DRAM (x4) error detection and apply a correction algorithm to the nibble.

Dependent on the processor SKU, each DDR4 channel supports one to two regions that can manage one or two faulty DRAMs, at Bank and/or full Rank granularity. The dynamic nature of the operation makes the performance implications of the lockstep operation on the system to be material only after the DRAM device is detected to be failing. The overall lockstep impact on system performance is now a function of the number of bad DRAM devices on the channel, with the worst-case scenario of two bad Ranks on every DDR4 channel.

The Silver/Bronze SKUs offer Adaptive Data Correction (ADC [SR]), at Bank granularity, and the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC [MR]), at Bank and Rank granularity, with additional hardware facilities for device map-out.

About the authors

Randall Lundin is a Senior Product Manager in the Lenovo Infrastructure Solution Group. He is responsible for planning and managing ThinkSystem servers. Randall has also authored and contributed to numerous Lenovo Press publications on ThinkSystem products.

Jason (Zhijun) Liu is a Principal Engineer and Senior UEFI Architect at Lenovo Infrastructure Solutions Group. Jason provides high-level infrastructure design support for Lenovo ThinkSystem UEFI ﬁrmware and leads the enabling, customization and innovation of new technologies into UEFI ﬁrmware. Jason also leads Reliability, Availability and Serviceability (RAS) architecture design and Secure feature design for ThinkSystem ﬁrmware.

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
ThinkSystem®

The following terms are trademarks of other companies:

Intel®, the Intel logo and Xeon® are trademarks of Intel Corporation or its subsidiaries.

IBM® is a trademark of IBM in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

Lenovo Press

Lenovo Press

RAS Features of the Lenovo ThinkSystem Intel Servers

Article

Author

Updated

Form Number

PDF size

Abstract

Change History