



# RAS Features of the Lenovo ThinkSystem Intel Servers Article

Applications such as database, enterprise resource planning (ERP), customer resource management (CRM), and business intelligence (BI) applications need to be available 24/7 on a wide area or global basis. In addition, the likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments.

While clustering and virtualization can help meet availability requirements, they are not adequate solutions for very large databases, BI, and high-end transactional systems. A failure affecting a single core business application can easily cost hundreds of thousands or even millions of dollars per hour. All this leads to a need for scalable and highly resilient servers that are well suited for critical business applications and large-scale consolidation.

# Always On

Time is money. Even a few minutes of downtime can result in significant costs and cause internal business operations to come to a standstill. Downtime can also adversely impact a company's relationship with its customers, business suppliers and partners. Reliability or lack thereof can potentially damage a company's reputation and result in lost business.

The growth of new applications has ratcheted database processing and business analytics to the top of the list for server workloads. These workloads demand continuous availability from the enterprise platforms on which they run.

"Always on" has become a global requirement and impacts many aspects of our lives:

- **Maximize productivity** Manufacturers need to keep their production line up and running. System downtime should not interrupt it.
- **Control access** Facility Security companies prevent external threats to organizations. Security application downtime shouldn't be an internal threat.
- **Protect profit** Retailers have sales targets to meet day in, day out. Transaction system downtime shouldn't get in the way.
- **Protect lives** First Responders take care of emergencies 24 x 7 x 365. Application downtime shouldn't be one of them.
- Ensure quality care and privacy Healthcare Institutions need to access patient information and be HIPPA compliant all the time. System downtime shouldn't compromise either one.
- **Process transactions** Financial Services organizations manage thousands of transactions a second. Processing system downtime simply can't happen

# Server RAS Defined

RAS in relation to servers is defined as follows:

**Reliability** – Reducing the mean time between hardware failures and ensuring data integrity. Data integrity is protected through error detection and correction — or, if not correctable, error containment

- Error Detection and Self-Healing
- Minimizes outage opportunities
- Correct results continuously

**Availability** – Refers to uninterrupted system and application operation even in the presence of uncorrectable errors

- Reduce frequency and duration of outages
- Self-diagnosing: work around faulty components or "self-heal"
- Never stops or slows down

**Serviceability** – Means a system can be maintained without disrupting operation. This capability requires both thoughtful platform design and innovative systems management.

- Avoid repeat failures with accurate diagnostics
- Concurrent repair on higher failure rate items
- Easy to repair and upgrade

# **Industry Cost of Downtime**

90% of Companies survived say that 1 hour of downtime costs exceed \$300K.



Figure 1. Hourly Cost of Downtime (from ITIC 2022-2023 Global Server Hardware reliability & Server Security Survey Results)

# **Unplanned Downtime by Hardware Platform**

According to ITIC, Lenovo ThinkSystem and IBM lead the way for unplanned downtime

- Lenovo ThinkSystem: 1.1 minutes
- Cisco UCS: 2 minutes
- Dell PowerEdge: 26 minutes
- HPE ProLiant: 39 minutes



Figure 2. Unplanned Downtime by Server Hardware Platform in January 2023 (from ITIC 2022-2023 Global Server Hardware reliability & Server Security Survey Results)

# **RAS** features with Intel Xeon Scalable processors

The following table is a list of Key RAS Features of the Intel Xeon Scalable processors on the Lenovo ThinkSystem servers.

Table 1. RAS features of Lenovo ThinkSystem servers with Intel Xeon Scalable processors

| Feature                                         | Category              | 2nd<br>Gen<br>Intel<br>(CLX) | 3rd<br>Gen<br>Intel<br>4S<br>(CPX) | 3rd<br>Gen<br>Intel<br>2S<br>(ICX) | 4th<br>Gen<br>Intel<br>(SPR) | Intel<br>Xeon<br>Max<br>Series<br>(SPR<br>HBM) | 5th<br>Gen<br>Intel<br>(EMR) | Benefit                                                                                                    |  |  |
|-------------------------------------------------|-----------------------|------------------------------|------------------------------------|------------------------------------|------------------------------|------------------------------------------------|------------------------------|------------------------------------------------------------------------------------------------------------|--|--|
| Advanced RAS fea                                | Advanced RAS features |                              |                                    |                                    |                              |                                                |                              |                                                                                                            |  |  |
| Viral Mode of error containment                 | Reliability           | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Enhanced error containment to improve<br>data integrity, complimentary to corrupt<br>data containment mode |  |  |
| Local Machine<br>Check (LMCE)<br>based Recovery | Reliability           | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Enhances MCA recovery-Execution path<br>event, and increases the possibility of<br>recovery                |  |  |

| Feature                                                | Category       | 2nd<br>Gen<br>Intel<br>(CLX) | 3rd<br>Gen<br>Intel<br>4S<br>(CPX) | 3rd<br>Gen<br>Intel<br>2S<br>(ICX) | 4th<br>Gen<br>Intel<br>(SPR) | Intel<br>Xeon<br>Max<br>Series<br>(SPR<br>HBM) | 5th<br>Gen<br>Intel<br>(EMR) | Benefit                                                                                                                                                                                                                                                         |
|--------------------------------------------------------|----------------|------------------------------|------------------------------------|------------------------------------|------------------------------|------------------------------------------------|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SDDC +1,<br>Adaptive DDDC<br>(ADDDC) (MR) +1*          | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Adaptive virtual lockstep delivers up to<br>two DRAM Device sub-region(s)<br>sparing/correction at bank and/or full<br>rank granularity. Also supports Single<br>DRAM correction, as well as single bit<br>correction post final DRAM device map<br>out.        |
| PCI Express Live<br>Error Recovery                     | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | PCI-e root port error containment, and<br>the opportunity to dynamically recover<br>from the error                                                                                                                                                              |
| Intel® UPI<br>Dynamic Link<br>width reduction          | Availability   | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Enables interconnect to continue<br>operation in presence of Interconnect<br>link persistent failure                                                                                                                                                            |
| Address<br>range/Partial<br>Memory Mirroring           | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | OS managed memory mirroring of<br>selective ranges, increases data<br>integrity at efficient cost                                                                                                                                                               |
| MCA 2.0 Recovery<br>(as per eMCA<br>gen2 architecture) | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Firmware first model enables a reliable<br>error sourcing capability with the ability<br>to write to the MSR                                                                                                                                                    |
| Standard RAS feat                                      | ures           |                              |                                    |                                    |                              |                                                |                              |                                                                                                                                                                                                                                                                 |
| Advanced Error<br>Detection and<br>Correction (AEDC)   | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Enhanced fault coverage within processor cores, and attempt to recover via instruction retry                                                                                                                                                                    |
| Error Detection and Correction                         | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Extensive Error detection and correction<br>capability across the silicon, and the<br>interconnects.                                                                                                                                                            |
| Corrupt Data<br>containment-Core                       | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Uncorrectable data explicitly marked<br>and delivered synchronously to the<br>consuming core to assist error<br>containment and increase system<br>reliability                                                                                                  |
| Corrupt Data<br>containment-<br>UnCore                 | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Uncorrectable data explicitly marked<br>and delivered synchronously to the<br>requestor, to assist error containment<br>and increase system reliability                                                                                                         |
| SDDC, Adaptive<br>Data Correction<br>(ADC) (SR)*       | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Adaptive virtual lockstep delivers single<br>DRAM Device sub-region<br>sparing/correction at bank granularity.<br>Also supports Single DRAM correction.                                                                                                         |
| PCIe "Stop and<br>Scream"                              | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | PCI-e root port corrupt data containment feature, increases data integrity                                                                                                                                                                                      |
| Memory Mirroring-<br>Intra iMC                         | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Increase data integrity by creating a<br>redundant/mirrored copy of data in<br>system DRAM                                                                                                                                                                      |
| Rank/Multi Rank<br>Sparing & DRAM<br>sparing           | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Reserved/spare ranks are utilized to<br>dynamically map out the failing rank into<br>the spare ranks. Starting with 3rd Gen<br>"Ice Lake" processors, this is<br>implemented using ADDDC/ ADC-SR/<br>ADDDC-MR to provide DRAM-level<br>sparing feature support. |
| Predictive Failure<br>Analysis                         | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Extensive error logs to assist software in<br>predicting failures                                                                                                                                                                                               |

| Feature                                                     | Category       | 2nd<br>Gen<br>Intel<br>(CLX) | 3rd<br>Gen<br>Intel<br>4S<br>(CPX) | 3rd<br>Gen<br>Intel<br>2S<br>(ICX) | 4th<br>Gen<br>Intel<br>(SPR) | Intel<br>Xeon<br>Max<br>Series<br>(SPR<br>HBM) | 5th<br>Gen<br>Intel<br>(EMR) | Benefit                                                                                                                                                   |
|-------------------------------------------------------------|----------------|------------------------------|------------------------------------|------------------------------------|------------------------------|------------------------------------------------|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Failed DIMM<br>Isolation                                    | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Extensive error logs to help software identify the failing DIMM                                                                                           |
| Virtual (soft)<br>Partitioning                              | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Virtual Machine Monitor ability to make<br>use of hardware recovery , signaling and<br>error logs                                                         |
| Error reporting via<br>IOMCA                                | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Unified error reporting of the IIO logic to the OS                                                                                                        |
| Error reporting<br>through MCA 2.0<br>(eMCA gen2)           | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Firmware first model enables a reliable error sourcing capability                                                                                         |
| Error reporting<br>through eMCA<br>gen1                     | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Firmware first model enables reliable error sourcing capability                                                                                           |
| PCle Card Hot<br>Plug NVMe (Add,<br>Remove, Swap)           | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Hot add and repalce of NVMe drives                                                                                                                        |
| PCI Express<br>ECRC                                         | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | PCI Express End to end CRC checking, increasing system reliability                                                                                        |
| PCIe Corrupt Data<br>Containment (Data<br>Poisoning)        | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | PCIe corrupt data mode of operation,<br>synchronous signaling of the corrupted<br>data along with data, increases system<br>reliability                   |
| PCIe Link CRC<br>Error Check and<br>Retry                   | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | PCIe link CRC error check and retry,<br>system reliability and recovery from<br>transient errors                                                          |
| PCIe Link<br>Retraining and<br>Recovery                     | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | PCIe link retraining and attempted<br>recovery from persistent link transient<br>errors                                                                   |
| Mem SMBus hang<br>recovery                                  | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Software ability to reset memory SMBus interface to recover from hang condition                                                                           |
| DDR4 Command/<br>Address Parity<br>Check and Retry          | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | DDR4 Address and command parity check and retry in the event of errors                                                                                    |
| Time-out timer<br>Schemes                                   | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Hierarchy of transaction time outs to<br>assist system debug and reliable error<br>sourcing.                                                              |
| Intel® UPI Link<br>Level Retry                              | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Intel UPI link's ability to perform CRC check and retry on errors for higher degree of system reliability                                                 |
| Intel® UPI Protocol<br>Protection via 16<br>bit Rolling CRC | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | Detection of transient data errors over<br>Intel UPI interconnects, via 16bit CRC<br>error checking                                                       |
| Processor BIST                                              | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | At power up, Processor's built in self test<br>engine performs test on the internal<br>cache structure for and provides the<br>results to the system BIOS |
| Socket disable for<br>FRB                                   | Availability   | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | The capability to selectively disable<br>socket at the boot time, and therefore<br>allowing system to power-on in a<br>failover configuration             |
| Core disable for<br>FRB                                     | Availability   | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | The capability to disable failing cores at boot time, map out the failing core                                                                            |

| Feature                                                   | Category       | 2nd<br>Gen<br>Intel<br>(CLX) | 3rd<br>Gen<br>Intel<br>4S<br>(CPX) | 3rd<br>Gen<br>Intel<br>2S<br>(ICX) | 4th<br>Gen<br>Intel<br>(SPR) | Intel<br>Xeon<br>Max<br>Series<br>(SPR<br>HBM) | 5th<br>Gen<br>Intel<br>(EMR) | Benefit                                                                                                                                                                                                                                                                                                            |
|-----------------------------------------------------------|----------------|------------------------------|------------------------------------|------------------------------------|------------------------------|------------------------------------------------|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Memory disable for<br>FRB                                 | Availability   | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | The capability to disable failing DIMMs at boot time, map out the failing DIMMs                                                                                                                                                                                                                                    |
| Memory demand<br>and patrol<br>Scrubbing                  | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | DRAM content is scrubbed with the<br>corrected data due to Demand or patrol<br>scrub operation. Scrubbing DRAM<br>location can prevent accumulation of<br>single-bit errors turning it into<br>uncorrected error.                                                                                                  |
| DDR Power Up<br>Post Package<br>Repair (PPR)              | Availability   | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | The capability will maps out what<br>platform identifies as bad rows with<br>spare rows in DDR DIMMs, and the<br>repair action will be executed duiring<br>memory training phase at system power<br>up.                                                                                                            |
| PIROM for System<br>Information<br>Storage                | Serviceability | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | On package Processor Information<br>ROM                                                                                                                                                                                                                                                                            |
| MCA Recovery-<br>Execution path                           | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | OS layer assisted recovery from<br>uncorrectable data errors to prevent<br>system reset                                                                                                                                                                                                                            |
| MCA Recovery-<br>Non execution path                       | Reliability    | Yes                          | Yes                                | Yes                                | Yes                          | Yes                                            | Yes                          | OS layer assisted recovery from<br>uncorrectable data errors detected by<br>Patrol scrubber or LLC Explicit Write<br>Back                                                                                                                                                                                          |
| DCU Scrubbing                                             | Reliability    | No                           | No                                 | Yes                                | Yes                          | Yes                                            | Yes                          | Improves system uptime by minimizing<br>the impact due to high energy particle<br>strike (Soft Errors) within core DCU (aka<br>L1D cache)                                                                                                                                                                          |
| Partial Cache Line<br>Sparing (PCLS)                      | Reliability    | No                           | No                                 | Yes                                | No                           | Yes                                            | No                           | Extend system uptime in case a single<br>bit memory persistent error is detected.<br>It allows mapping out the cache line with<br>failed single bit by using spare capacity<br>available within the IMC                                                                                                            |
| PCIe Enhanced<br>Downstream Port<br>Containment<br>(EDPC) | Reliability    | No                           | No                                 | Yes                                | Yes                          | Yes                                            | Yes                          | EDPC is an enhancement to the<br>Downstream Port Containment (DPC)<br>thereby adding Root Port Programmable<br>IO (RPPIO) errors.                                                                                                                                                                                  |
| DCU/IFU Error<br>Handling<br>Enhancement                  | Reliability    | No                           | No                                 | Yes                                | Yes                          | Yes                                            | Yes                          | Improving MCA Recovery (Execution<br>path) coverage in case CPU core<br>(DCU/IFU) receives multiple back-to-<br>back read data with 'Poison Error' (aka<br>Poison Storm).                                                                                                                                          |
| Memory Permeant<br>Fault Detection<br>(PFD)               | Reliability    | No                           | No                                 | No                                 | Yes                          | Yes                                            | Yes                          | Memory controller will use algrithm to<br>detect and report permanent memory<br>fault                                                                                                                                                                                                                              |
| DDR5 On-die Error<br>Check and Scrub                      | Reliability    | No                           | No                                 | No                                 | Yes                          | Yes                                            | Yes                          | DDR5 has on-die ECC Error Check and<br>Scrub (ECS) mode with an error<br>counting scheme for transparency. The<br>ECS mode allows the DRAM to<br>internally read, correct single bit errors,<br>and write back corrected data bits to the<br>array (scrub errors) while providing<br>transparency to error counts. |

| Feature                                             | Category                                                     | 2nd<br>Gen<br>Intel<br>(CLX) | 3rd<br>Gen<br>Intel<br>4S<br>(CPX) | 3rd<br>Gen<br>Intel<br>2S<br>(ICX) | 4th<br>Gen<br>Intel<br>(SPR) | Intel<br>Xeon<br>Max<br>Series<br>(SPR<br>HBM) | 5th<br>Gen<br>Intel<br>(EMR) | Benefit                                                                                                                                                                                                                              |  |  |  |
|-----------------------------------------------------|--------------------------------------------------------------|------------------------------|------------------------------------|------------------------------------|------------------------------|------------------------------------------------|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| DDR5 PMIC Error<br>Handling                         | Reliability                                                  | No                           | No                                 | No                                 | Yes                          | Yes                                            | Yes                          | DDR5 has Power Management IC<br>(PMIC) on DIMM, this capability will<br>isolate the DIMM with PMIC error.                                                                                                                            |  |  |  |
| RAS features for p                                  | RAS features for processors with High Bandwidth Memory (HBM) |                              |                                    |                                    |                              |                                                |                              |                                                                                                                                                                                                                                      |  |  |  |
| HBM Only mode,<br>1LM, 2LM                          | Availability                                                 | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | High Bandwidth Memory (HBM), it<br>supports HBM Only mode (no DDR<br>DIMMs in the system), 1LM mode (1<br>level memory: HBM acts same as<br>DDR), and 2LM (2 levels memory: HBM<br>acts as cache of DDR).                            |  |  |  |
| HBM Permeant<br>Fault Detection<br>(PFD)            | Reliability                                                  | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | Memory controller will use algrithm to detect and report HBM permanent fault.                                                                                                                                                        |  |  |  |
| HBM<br>Command/Address<br>parity check and<br>retry | Reliability                                                  | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | Memory controller will retry the<br>transaction and try to recovery at<br>CMD/ADDR parity errors on the<br>CMD/ADDR bus.                                                                                                             |  |  |  |
| HBM Power Up<br>PPR                                 | Availability                                                 | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | The capability will maps out what<br>platform identifies as bad rows with<br>spare rows in HBM memory, and the<br>repair action will be executed duiring<br>HBM training phase at system power up.                                   |  |  |  |
| HBM Demand and<br>Patrol Scrub                      | Reliability                                                  | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | Similar to DDR, HBM DRAM content is<br>scrubbed with the corrected data due to<br>Demand or patrol scrub operation.<br>Scrubbing DRAM location can prevent<br>accumulation of single-bit errors turning<br>it into uncorrected error |  |  |  |
| HBM Memory bank<br>sparing                          | Reliability                                                  | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | Legacy Bank sparing operation, each<br>pseudo channel of HBM supports spare<br>bank for UEFI to make use of after it<br>detects bank failing on the pseudo<br>channel.                                                               |  |  |  |
| HBM Corrected<br>Error Reporting                    | Reliability                                                  | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | Memory Corrected Error reporting<br>supports per rank corrected error<br>counters with leaky bucket algorithm.                                                                                                                       |  |  |  |
| HBM Disable/Map<br>out for FRB                      | Availability                                                 | N/A                          | N/A                                | N/A                                | N/A                          | Yes                                            | N/A                          | The capability to disable failing HBM at boot time, map out the failing HBM.                                                                                                                                                         |  |  |  |

\* See additional information about ADDDC below

All ThinkSystem servers with Intel Xeon Scalable processors support the ADC (SR or Single Region) and ADDDC (MR or Multiple Region) feature which can support DRAM sub-region(s) sparing at Bank and/or full Rank granularity, compared to the memory rank sparing feature.

Details of Adaptive Double DRAM Device Correction (ADDDC) from the Intel article "New Reliability, Availability, and Serviceability (RAS) Features in the Intel® Xeon® Processor Family" https://www.intel.com/content/www/us/en/developer/articles/technical/new-reliability-availability-and-serviceability-ras-features-in-the-intel-xeon-processor.html Intel Xeon introduces an innovative approach in managing errors that the DDR4 DRAM DIMM may induce through the life of the product. ADDDC is deployed at runtime to dynamically map out the failing DRAM device and continue to provide SDDC ECC coverage on the DIMM, translating to longer DIMM longevity. The operation occurs at the fine granularity of DRAM Bank and/or Rank to have minimal impact on the overall system performance.

With the advent of ADDDC, the memory subsystem is always configured to operate in performance mode. When the number of corrections on a DRAM device reaches the targeted threshold value, with help from the UEFI runtime code, the identified failing DRAM region is adaptively placed in lockstep mode where the identified failing region of the DRAM device is mapped out of ECC. Once in ADDDC, cache line ECC continues to cover single DRAM (x4) error detection and apply a correction algorithm to the nibble.

Dependent on the processor SKU, each DDR4 channel supports one to two regions that can manage one or two faulty DRAMs, at Bank and/or full Rank granularity. The dynamic nature of the operation makes the performance implications of the lockstep operation on the system to be material only after the DRAM device is detected to be failing. The overall lockstep impact on system performance is now a function of the number of bad DRAM devices on the channel, with the worst-case scenario of two bad Ranks on every DDR4 channel.

The Silver/Bronze SKUs offer Adaptive Data Correction (ADC [SR]), at Bank granularity, and the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC [MR]), at Bank and Rank granularity, with additional hardware facilities for device map-out.

## **Further reading**

For further analysis of the reliability of Lenovo servers, see the latest ITIC report, available from:

ITIC 2022-2023 Global Server Hardware Reliability & Server Security Survey Results: https://lenovopress.lenovo.com/lp1117-itic-reliability-study

This article is one in a series on the ThinkSystem V3 servers:

- Five Highlights of the Lenovo ThinkSystem SR630 V3 Server
- Five Highlights of the Lenovo ThinkSystem SR650 V3 Server
- Five Highlights of the Lenovo ThinkSystem SR850 V3 Server
- Five Highlights of the Lenovo ThinkSystem SR860 V3 Server
- RAS Features of the Lenovo ThinkSystem Intel Servers

#### About the author

Randall Lundin is a Senior Product Manager in the Lenovo Infrastructure Solution Group. He is responsible for planning and managing ThinkSystem servers. Randall has also authored and contributed to numerous Lenovo Press publications on ThinkSystem products.

### Notices

Lenovo may not offer the products, services, or features discussed in this document in all countries. Consult your local Lenovo representative for information on the products and services currently available in your area. Any reference to a Lenovo product, program, or service is not intended to state or imply that only that Lenovo product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any Lenovo intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any other product, program, or service. Lenovo may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:

Lenovo (United States), Inc. 8001 Development Drive Morrisville, NC 27560 U.S.A. Attention: Lenovo Director of Licensing

LENOVO PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. Lenovo may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

The products described in this document are not intended for use in implantation or other life support applications where malfunction may result in injury or death to persons. The information contained in this document does not affect or change Lenovo product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of Lenovo or third parties. All information contained in this document was obtained in specific environments and is presented as an illustration. The result obtained in other operating environments may vary. Lenovo may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Any references in this publication to non-Lenovo Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this Lenovo product, and use of those Web sites is at your own risk. Any performance data contained herein was determined in a controlled environment. Therefore, the result obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.

#### © Copyright Lenovo 2024. All rights reserved.

This document, LP1711, was created or updated on April 22, 2024.

Send us your comments in one of the following ways:

- Use the online Contact us review form found at: https://lenovopress.lenovo.com/LP1711
- Send your comments in an e-mail to: comments@lenovopress.com

This document is available online at https://lenovopress.lenovo.com/LP1711.

# Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at <a href="https://www.lenovo.com/us/en/legal/copytrade/">https://www.lenovo.com/us/en/legal/copytrade/</a>.

The following terms are trademarks of Lenovo in the United States, other countries, or both: Lenovo® ThinkSystem®

The following terms are trademarks of other companies:

Intel® and Xeon® are trademarks of Intel Corporation or its subsidiaries.

Other company, product, or service names may be trademarks or service marks of others.