Author
Updated
2 Apr 2024Form Number
LP1651PDF size
15 pages, 280 KBAbstract
The Lenovo EveryScale HPC & AI Software Stack combines open-source with proprietary best-of-breed supercomputing software to provide the most consumable open-source HPC software stack embraced by all Lenovo HPC customers.
This product guide provides essential pre-sales information to understand the key features and components of the EveryScale HPC & AI Software Stack. The product guide is intended for technical specialists, sales specialists, sales engineers, IT architects, and other IT professionals who want to learn more about the Lenovo EveryScale HPC & AI Software Stack.
Change History
Changes in the April 2, 2024 update:
- The LiCO Kubernetes K8S version part numbers have been withdrawn from marketing - Orchestration and management section
Introduction
The Lenovo EveryScale HPC & AI Software Stack combines open-source with proprietary best-of-breed Supercomputing software to provide the most consumable open-source HPC software stack embraced by all Lenovo HPC customers.
It provides a fully tested and supported, complete but customizable HPC software stack to enable the administrators and users in optimally and environmentally sustainable utilizing their Lenovo Supercomputers.
The software stack is built on the most widely adopted and maintained HPC community software for orchestration and management. It integrates third party components especially around programming environments and performance optimization to complement and enhance the capabilities, creating the organic umbrella in software and service to add value for our customers.
The software stack offers key software and support components for orchestration and management, programming environments and services and support, as shown in the following figure.
Did you know?
Lenovo EveryScale HPC & AI Software Stack is a modular software stack tailored to our customer's needs. Thoroughly tested, supported and periodically updated, it combines the latest open-source HPC software releases to enable organizations with an agile and scalable IT infrastructure.
Benefits
The Lenovo EveryScale HPC & AI Software Stack provides the following benefits to customers.
Overcoming the Complexity of HPC Software
An HPC system software stack consists of dozens of components, that administrators must integrate and validate before an organization’s HPC applications can run on top of the stack. Ensuring stable, reliable versions of all stack components is an enormous task due to the numerous interdependencies. This task is very time consuming because of the constant release cycles and updates of individual components.
The Lenovo EveryScale HPC & AI Software Stack is fully tested, supported and periodically updated to combine the latest open-source HPC software releases, enabling organizations with an agile and scalable IT infrastructure.
Benefits of the Open-source Model
Going forward, in IDC's opinion, the development model exemplified by Linux is more workable. In this model, stack development is driven primarily by the open-source community and vendors offer supported distributions with additional capabilities for customers that require and are willing to pay for them. As the Linux initiative demonstrates, a community-based model like this has major advantages for enabling software to keep pace with requirements for HPC computing and storage hardware systems.
This model delivers new capabilities faster to users and makes HPC systems more productive and higher returning investments.
A fair number of foundational open source HPC software components already exist (e.g., Open MPI, Rocky Linux, Slurm, OpenStack, and others). Many HPC community members are already taking advantage of these.
Customers will benefit from the HPC community, as the community works to integrate a multitude of components that are commonly used in HPC systems and are freely available for open source distribution.
The key open-source components of the software stack are:
- Confluent Management
Confluent is Lenovo-developed open-source software designed to discover, provision, and manage HPC clusters and the nodes that comprise them. Confluent provides powerful tooling to deploy and update software and firmware to multiple nodes simultaneously, with simple and readable modern software syntax.
- Slurm Orchestration
Slurm is integrated as an open source, flexible, and modern choice to manage complex workloads for faster processing and optimal utilization of the large-scale and specialized high-performance and AI resource capabilities needed per workload provided by Lenovo systems. Lenovo provides support in partnership with SchedMD.
- LiCO Webportal
Lenovo Intelligent Computing Orchestration (LiCO) is a Lenovo-developed consolidated Graphical User Interface (GUI) for monitoring, managing and using cluster resources. The web portal provides workflows for both AI and HPC, and supports multiple AI frameworks, including TensorFlow, Caffe, Neon, and MXNet, allowing you to leverage a single cluster for diverse workload requirements.
- Energy Aware Runtime
EAR is a powerful European open-source energy management suite supporting anything from monitoring over power capping to live-optimization during the application runtime. Lenovo is collaborating with Barcelona Supercomputing Centre (BSC) and EAS4DC on the continuous development and support and offers three versions with differentiating capabilities.
Software components
Components are covered in the following sections:
Orchestration and management
The following orchestration software is available with Lenovo EveryScale HPC & AI Software Stack:
- Confluent (Best Recipe interoperability)
Confluent is Lenovo-developed open source software designed to discover, provision, and manage HPC clusters and the nodes that comprise them. Our Confluent management system and LiCO Web portal provide an interface designed to abstract the users from the complexity of HPC cluster orchestration and AI workloads management, making open-source HPC software consumable for every customer. Confluent provides powerful tooling to deploy and update software and firmware to multiple nodes simultaneously, with simple and readable modern software syntax. Additionally, Confluent’s performance scales seamlessly from small workstation clusters to thousand-plus node supercomputers. For more information, see the Confluent documentation.
- Lenovo Intelligent Computing Orchestration (Best Recipe interoperability)
Lenovo Intelligent Computing Orchestration (LiCO) is a Lenovo-developed software solution that simplifies the management and use of distributed clusters for High Performance Computing (HPC) and Artificial Intelligence (AI) environments. LiCO provides a consolidated Graphical User Interface (GUI) for monitoring and usage of cluster resources, allowing you to easily run both HPC and AI workloads across a choice of Lenovo infrastructure, including both CPU and GPU solutions to suit varying application requirements.
LiCO Web portal provides workflows for both AI and HPC, and supports multiple AI frameworks, including TensorFlow, Caffe, Neon, and MXNet, allowing you to leverage a single cluster for diverse workload requirements. For more information, see the LiCO product guide.
-
LiCO customization service
Lenovo Intelligent Computing Orchestration (LiCO) customization services enable customers to request customized features tailored for their own needs. The service is evaluated and quoted in the form of man-days, based on the actual order list.
HPC solution sellers need to provide pre-sales support, collaborate with HPC architects, and communicate with LiCO's R&D team (Ding Hong dinghong1@lenovo.com) for workload evaluations. A quote is provided by Lenovo based on the output SOW and analysis of the workload. After the sellers place an order, they will email the R&D team to request implementation. The LiCO R&D team will deliver the work based on the order content and SOW. - Slurm
Slurm is a modern, open-source scheduler designed specifically to satisfy the demanding needs of high-performance computing (HPC), high throughput computing (HTC) and AI. Slurm is developed and maintained by SchedMD® and integrated within LiCO. Slurm maximizes workload throughput, scale, reliability, and results in the fastest possible time while optimizing resource utilization and meeting organizational priorities. Slurm automates job scheduling to help admin and users manage the complexities of on-prem, hybrid, or cloud workspaces. Slurm workload manager executes faster and is more reliable ensuring increased productivity while decreasing costs. Slurm’s modern, plug-in-based architecture runs on a RESTful API supporting both large and small HPC, HTC, and AI environments. Allow your teams to focus on their work while Slurm manages their workloads.
- NVIDIA Unified Fabric Manager (UFM) (ISV supported)
NVIDIA Unified Fabric Manager (UFM) is InfiniBand networking management software that combines enhanced, real-time network telemetry with fabric visibility and control to support scale-out InfiniBand data centers. For more information, see the NVIDIA UFM product page.
The two UFM offerings available from Lenovo are as follows:
- UFM Telemetry for Real-Time Monitoring
The UFM Telemetry platform provides network validation tools to monitor network performance and conditions, capturing and streaming rich real-time network telemetry information, application workload usage, and system configuration to an on-premises or cloud-based database for further analysis.
- UFM Enterprise for Fabric Visibility and Control
The UFM Enterprise platform combines the benefits of UFM Telemetry with enhanced network monitoring and management. It performs automated network discovery and provisioning, traffic monitoring, and congestion discovery. It also enables job schedule provisioning and integrates with industry-leading job schedulers and cloud and cluster managers, including Slurm and Platform Load Sharing Facility (LSF).
- UFM Telemetry for Real-Time Monitoring
The following table lists all Orchestration software available with Lenovo EveryScale HPC & AI Software Stack.
Programming environment
The following programming software is available with Lenovo EveryScale HPC&AI Software Stack.
- NVIDIA CUDA
NVIDIA CUDA is a parallel computing platform and programming model for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. For more information, see the NVIDIA CUDA Zone.
- NVIDIA HPC Software Development Kit
The NVIDIA HPC SDK C, C++, and Fortran compilers support GPU acceleration of HPC modeling and simulation applications with standard C++ and Fortran, OpenACC directives, and CUDA. GPU-accelerated math libraries maximize performance on common HPC algorithms, and optimized communications libraries enable standards-based multi-GPU and scalable systems programming. Performance profiling and debugging tools simplify porting and optimization of HPC applications, and containerization tools enable easy deployment on-premises or in the cloud. For more information, see the NVIDIA HPC SDK.
The following table lists the relevant ordering part numbers.
Support components
The following software support is available with Lenovo EveryScale HPC&AI Software.
- SchedMD Slurm Support for Lenovo HPC Systems
Slurm is part of the Lenovo EveryScale HPC & AI Software Stack, integrated as an open source, flexible, and modern choice to manage complex workloads for faster processing and optimal utilization of the large-scale and specialized high-performance and AI resource capabilities needed per workload provided by Lenovo systems.
SchedMD Slurm Support service capabilities for Lenovo HPC systems include:
- Level 3 Support: High-performance systems must perform at high utilization and performance to meet end users and management return on the investment expectations. Customers covered by a support contract can reach out to SchedMD engineer experts to promptly resolve complex workload management issues and receive answers back to complex config questions quickly, instead of taking weeks or even months to try to resolve them in-house.
- Remote Consulting: Valuable assistance and implementation expertise that speeds custom configuration tuning to increase throughput and utilization efficiency on complex and large-scale systems. Customers can review cluster requirements, operating environment, and organizational goals directly with a Slurm engineer to optimize the configuration and meet organizational needs.
- Tailored Slurm Training: Tailored Slurm expert training that empowers users on harnessing Slurm capabilities to speed projects and increase technology adoption. A customer scoping call before the onsite Instruction ensures coverage of specific use cases addressing organization needs. An in-depth and comprehensive technical training is delivered in a hands-on lab workshop format to help users feel empowered on Slurm best practices in their site-specific use cases and configuration.
- EAS Service and Support for EAR
The Energy Aware Runtime is Open Source under BSD-3 license and EPL-1.0. For professional use cases in production environments, installation (remote) and support services are available. Commercial support as well as implementation services for EAR can be purchased from Lenovo under the EveryScale HPC & AI Software Stack CTO and is delivered through Energy Aware Solutions (EAS). There are three different distributions of EAR: Detective Pro, Optimizer and Optimizer Pro. Detective Pro provides the basic monitoring and accounting capabilities, Optimizer adds the energy optimization and Optimizer Pro the power capping features.
- For clusters with more than 500 nodes – call for a quote.
- EAS does not propose installation/training without any support contract.
- Installation/training and Support prices depend on whether the cluster has only CPUs (Intel or AMD) or CPUS + GPUS (Intel or AMD CPUs and NVIDIA GPUs.
- As Optimizer Pro and Optimizer goal is to control and/or reduce the power consumption of the system, their yearly support is priced according to the power consumption of the cluster and their associated Support Entitlement per System Power Rating (SPR). At same configuration, Optimizer Pro and Optimizer cluster power consumptions will be identical but their Support Entitlement per System Power Rating (SPR) will be different.
- Detective Pro is the only service proposed for clusters with a Total System Power less than 127 KW and SPR value less than 10.
- A spreadsheet is provided to compute the SPR of a cluster and its associated service prices for Detective Pro, Optimizer pro and Optimizer. For the latest spreadsheet and additional information please reach out to the product manager.
-
Intel oneAPI
The Intel oneAPI Base & HPC Toolkit is a comprehensive software development suite designed to empower developers in creating high-performance computing (HPC) solutions that exploit the full potential of modern hardware architectures. This toolkit encompasses an array of advanced tools, libraries, and compilers, enabling programmers to efficiently design, optimize, and deploy parallel applications across diverse computing platforms, including CPUs, GPUs, and FPGAs. With a focus on fostering code portability and performance scalability, the Intel oneAPI Base & HPC Toolkit equips developers with the means to enhance productivity, streamline software development, and achieve exceptional performance outcomes in the realm of high-performance computing.-
For more information, see Intel® oneAPI Base and HPC Toolkit.
-
Target platforms for development and deployment can range from a small system to a large multi-node cluster requiring different support efforts.
-
Support Renewals are available.
-
For Developer-based parts if the team is above 50 Developers (SBF1 and SBEG) – call for a quote.
-
Special pricing for academic research is available.
- Commercial parts have different part numbers if they are quoted with or without Intel hardware.
-
The following table lists the relevant ordering part numbers Stack (some of the product numbers are not yet released at the time of writing this product guide
* SchedMD Slurm Onsite or Remote 3-day Training: in-depth and comprehensive site-specific technical training. Can only be added to a support purchase.
** SchedMD Slurm Consulting w/Sr.Engineer 2REMOTE Sessions (Up to 8 hrs): review initial Slurm setup, in-depth technical chats around specific Slurm topics & review site config for optimization & best practices. Required with support purchase, cannot be purchased separately.
Note: SchedMD Slurm Consulting w/Sr.Engineer 2REMOTE Sessions option must be selected and locked in for every SchedMD support selection.
SchedMD Slurm Onsite or Remote 3-day Training option must be selected and locked in for every SchedMD Commercial support selection. Optional for EDU & Government support selections.
The following table lists all Intel oneAPI software available with Lenovo EveryScale HPC & AI Software Stack.
Seller training courses
The following sales training courses are offered for employees and partners (login required). Courses are listed in date order.
-
VTT HPC: LiCO-Computing Orchestration for AI and HPC
2024-07-30 | 92 minutes | Employees Only
DetailsVTT HPC: LiCO-Computing Orchestration for AI and HPC**NOTE: To download the attached PPT, Launch the course, exit the player to return to this screen, then scroll down to find the PPT to download.**
Published: 2024-07-30
Please view this session as Ana Irimiea, AI Systems and Solutions Product Manager at ISG, speaks with us about LiCO version 7.2.1.
She will talk about:
Overview of LiCO
Administrator and user capabilities
Deployment options
Ordering LiCO
Roadmap
Length: 92 minutesStart the training:
Employee link: Grow@Lenovo -
Enterprise Deployment of AI and Phases of Model Development
2024-05-23 | 12 minutes | Employees and Partners
DetailsEnterprise Deployment of AI and Phases of Model DevelopmentLenovo Senior AI Data Scientist Dr David Ellison whiteboards the concepts of using data from multiple sources to derive customer benefits through Artificial Intelligence and LiCO (Lenovo Intelligent Computing Orchestration) software.
Published: 2024-05-23
By the end of this training, you should be able to:
• Describe enterprise deployment of Artificial Intelligence
• Explain the process of model development
• State the purpose of LiCO (Lenovo Intelligent Computing Orchestration) software
• List three examples of Artificial Intelligence solutions
Length: 12 minutesStart the training:
Employee link: Grow@Lenovo
Partner link: Lenovo Partner Learning -
Selling Lenovo Intelligent Computing Orchestration
2021-08-25 | 18 minutes | Employees and Partners
DetailsSelling Lenovo Intelligent Computing OrchestrationThe goal of this course is to help ISG and Business Partner sellers understand Lenovo Intelligent Computing Orchestration (LiCO) software. Learn when and how to propose LiCO in order to continue the conversation with the customer and making a sale.
Published: 2021-08-25
Length: 18 minutesStart the training:
Employee link: Grow@Lenovo
Partner link: Lenovo Partner Learning
Resources
For more information, see these resources:
- LiCO Product Guide:
https://lenovopress.lenovo.com/lp0858-lenovo-intelligent-computing-orchestration-lico#product-families - LiCO website:
https://www.lenovo.com/us/en/data-center/software/lico/ - Lenovo DSCS configurator:
https://dcsc.lenovo.com - Optimizing Power and Energy in HPC data centers with Energy Aware Runtime
https://lenovopress.lenovo.com/lp1646 - Energy Aware Runtime software and documentation:
http://www.eas4dc.com - Lenovo Confluent documentation:
https://hpc.lenovo.com/users/documentation/ - Lenovo Compute Orchestration in HPC Data Centers with Slurm - Solution Brief:
https://lenovopress.lenovo.com/lp1701-lenovo-compute-orchestration-in-hpc-data-centers-with-slurm
Trademarks
Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.
The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
The following terms are trademarks of other companies:
AMD is a trademark of Advanced Micro Devices, Inc.
Intel® is a trademark of Intel Corporation or its subsidiaries.
Linux® is the trademark of Linus Torvalds in the U.S. and other countries.
Other company, product, or service names may be trademarks or service marks of others.
Configure and Buy
Full Change History
Changes in the April 2, 2024 update:
- The LiCO Kubernetes K8S version part numbers have been withdrawn from marketing - Orchestration and management section
Changes in the Februart 21, 2024 update:
- The following have been reactivated under - Software components section
- Lenovo Confluent 1 Year Support per managed node, 7S090039WW
- Lenovo Confluent 3 Year Support per managed node, 7S09003AWW
- Lenovo Confluent 5 Year Support per managed node, 7S09003BWW
- Lenovo Confluent 1 Extension Year Support per managed node, 7S09003CWW
Changes in the February 20, 2024 update:
- Updates to "Intel oneAPI options" under - Support components section
- Added the following new - Seller training courses section
Changes in the December 5, 2023 update:
- Withdrawn the following "Lenovo Confluent support" under - Software components section
- Lenovo Confluent 1 Year Support per managed node, 7S090039WW
- Lenovo Confluent 3 Year Support per managed node, 7S09003AWW
- Lenovo Confluent 5 Year Support per managed node, 7S09003BWW
- Lenovo Confluent 1 Extension Year Support per managed node, 7S09003CWW
Changes in the October 11, 2023 update:
- Updates to the Software components section:
- Part number updates - NVIDIA UFM Telemetry
Changes in the September 19, 2023 update:
- Updates to the Support Components section:
- Added additional details to - EAS Service and Support for EAR - feature
- Added additional options to - Intel oneAPI options - table
Changes in the August 14, 2023 update:
- Added new label to the title and multiple components - EveryScale - to align with new rebranding
- Updates to Software components section:
- Intel oneAPI Base & HPC Toolkit (Multi-Node) Support
- LiCO customization service
Changes in the January 9, 2023 update:
- Added new SKUs in the Software components section:
- Lenovo Confluent support
- NVIDIA CUDA
- NVIDIA HPC SDK
- EAS Service and Support for EAR
First published: November 10, 2022
Course Detail
Employees Only Content
The content in this document with a is only visible to employees who are logged in. Logon using your Lenovo ITcode and password via Lenovo single-signon (SSO).
The author of the document has determined that this content is classified as Lenovo Internal and should not be normally be made available to people who are not employees or contractors. This includes partners, customers, and competitors. The reasons may vary and you should reach out to the authors of the document for clarification, if needed. Be cautious about sharing this content with others as it may contain sensitive information.
Any visitor to the Lenovo Press web site who is not logged on will not be able to see this employee-only content. This content is excluded from search engine indexes and will not appear in any search results.
For all users, including logged-in employees, this employee-only content does not appear in the PDF version of this document.
This functionality is cookie based. The web site will normally remember your login state between browser sessions, however, if you clear cookies at the end of a session or work in an Incognito/Private browser window, then you will need to log in each time.
If you have any questions about this feature of the Lenovo Press web, please email David Watts at dwatts@lenovo.com.