skip to main content

Lenovo GOAST Bioinformatics Solution

Solution Brief

Home
Top
Author
Updated
1 May 2024
Form Number
LP1888
PDF size
7 pages, 213 KB

Abstract

Bioinformatics research and genomics data analyses are dependent on access to High-Performance Computing (HPC) resources to solve problems ranging from modern plant breeding to precision medicine. As the scope of omics grows, including the need for real time results, data analysis poses a large bottleneck.

Lenovo’s Genomics Optimization and Scalability Tool (GOAST) has improved upon the Intel Select Solution for Genomics to optimize performance on Lenovo hardware. The GOAST solution is accelerating research by enabling Life Scientists to process more samples every day thus decreasing time to discovery. With industry leading technology experts, we take a customer-centric approach to provide the omics and bioinformatic solutions that best meet your needs.

This solution brief covers the key features and capabilities of GOAST Version 4.0.

Change History

Changes in the May 1, 2024 update:

  • Updated the Intel configuration to include 5th Gen Intel Xeon processors - Table 1
  • Updated the GOAST results for the Intel configuration - Table 2

A Genomics and Bioinformatics Optimized System

Bioinformatics is the computational analysis of biological data powering research all the way from basic biology to medicine, to drug discovery, to agriculture, and more. The last two decades has seen an explosion of bioinformatics data which has been enabled by the advancements in computational resources. Yet, to date, even at cluster and supercomputer speeds, large-scale bioinformatics still faces long execution times on massive volumes of data, which delays “time to answer”.

The only high-performance solutions to mitigate these challenges are single-purpose boutique solutions requiring expensive specialty hardware and substantial licensing fees. These single-purpose genomics solutions require organizations to purchase multiple architectures to support other types of research in their datacenter. Even those organizations working in a single subfield (e.g. in genomics) find that the single-purpose boutique solutions are not enough.

For example, those performing secondary genomics analytics find that they also need computational resources to gather, select, manage, transform, and describe their data, in both primary and tertiary downstream analyses. General-purpose data centers worldwide feel this need even more acutely since the Omics (genomics, transcriptomics, proteomics) are only a fraction of the users they must serve. Therefore, Lenovo developed Genomics Optimization and Scalability Tool (GOAST), a Genomics and Bioinformatics Optimized platform.

Extremely Fast Bioinformatics Analytics

Lenovo Genomics Optimization and Scalability Tool (GOAST) is a multi-purpose system specifically engineered to meet the demands of bioinformatics workloads. GOAST leverages an architecture of carefully selected hardware tuned to accelerate bioinformatics performance. Lenovo GOAST’s high-core, fast I/O, and high-memory specs (Table 1) excel at running the massively parallel applications and sequential workflows common in Bioinformatics, including multi-omics (genomics, transcriptomics, proteomics) applications. GOAST accelerates mapping and whole-genome sequencing (WGS) variant calling analytics from days to minutes. This process typically takes 40 hours in many data centers, but runs in just ~23.5 minutes in GOAST systems.

Table 1. GOAST v4.0 reference architectures for bioinformatics (fully customizable)
  GOAST Intel Base GOAST AMD Base
Processor 2x Intel 8592+ CPUs (64 cores, 1.9GHz) 2x AMD 9654 CPUs (96 cores, 2.4 GHz)
Memory 1 TB RAM, 16x 64GB/5600MT/s RDIMMs 1.5 TB RAM, 24x 64GB/4800MHz DIMMs
Storage Minimum 7TB SATA or NVMe SSD Minimum 7TB SATA or NVMe SSD

Key features of Lenovo GOAST

The key features of Genomics Optimization and Scalability Tool (GOAST) include the following:

  • Extremely fast analytics

    Bioinformatics optimized hardware runs sequential workflows faster: e.g. process 30x whole genomes at a throughput rate of up to 2.5 samples/hour and 50x whole exomes at 60 samples /hour

  • Increases lab productivity

    Faster time to insight: e.g. up to ~22K whole genomes/node/year

  • Multi-purpose Bioinformatics use

    Leverage BOSS’s high-core, fast I/O, and high-memory specs to run any Bioinformatics or HPC tools or scripts

  • Cost effective

    Up to 50% less than boutique solutions relying on GPUs or FPGAs without additional licensing fees. A single GOAST server can replace up to 50 standard nodes

  • Scalable

    Deploy as a single-node appliance or as a cluster and grow linearly with flexibility

  • Easy to use

    Simplified wrapper scripts for omics at the command line

GOAST v4.0 Capabilities

Lenovo GOAST is a recipe that comes with several preconfigured and optimized workflows:

  • Germline variant calling (WES and WGS)
  • Germline joint calling (WES and WGS)
  • Somatic short variant discovery (WES and WGS)
    • Tumor + Normal pair
    • Tumor Only

All needed dependencies come pre-installed in a Conda environment which is easily replicated for additional users. Workflows are submitted using the command line GOAST Util Tool which enables users to submit complex GATK workflows with a single command. This tool automatically allocates resources to be used most efficiently based on the number of samples submitted and the available computational resources. Users may also monitor progress, abort, and restart jobs, and manage temporary files.

GOAST v4.0, comes with several major updates including additional workflows, Snakemake as the workflow manager, utilizing Conda to manage software installations as well as a complete rewrite of the backend GOAST Util Tool.

Increased Lab Productivity

Accelerated execution speeds mean you get to process more samples, find answers faster, and generate breakthroughs that much sooner. GOAST outperforms any other competing CPU-based (and even the FPGA- and GPU-based) systems because we tune our systems to meet the requirements of bioinformatics pipelines running in-node workloads rather than those assumed in traditional HPC workloads. The result is the ability to run software pipelines in higher throughputs. Higher throughput capacity means batches of samples analyzed in less time. (Table 2).

Table 2. Lab Productivity for Omics expected on a single GOAST system*
Expected Lab Productivity 30x WGS Samples processed (n) 50x WES Samples processed (n)
GOAST Intel Base GOAST AMD Base GOAST Intel Base GOAST AMD Base
WGS/day/node 42.4 60.0 1,011 1,440
WGS/year/node 15,500 21,900 369,000 525,600

* Performance is based on the processing of this NA12878 sample which was sequenced on the NovaSeq 6000. All processing was performed on local NVMe drives. Performance may vary based on hardware setup and coverage of sample.

Lenovo has performed the heavy lifting of optimizing workflows as well as ensuring software updates work seamlessly together. This gives researchers the opportunity to focus on the research questions, instead of spending valuable time tuning hardware, tweaking software versions, and optimizing workflows.

Multi-purpose Bioinformatics use

GOAST is a high-performance system for multi-purpose Bioinformatics use. The system comes preloaded with Omics tools to get you up and running on day one or it can be fully customized with the Bioinformatics tools of your choice.

  • For multi-omics analytics: Lenovo pre-installs the tools and other dependencies in Table 3 necessary to run the Broad Institute’s GATK Best Practices for Germline and Somatic SNP and Indel discovery. Lenovo GOAST also provides pre-configured scripts to allow you to run (submit, monitor, manage) samples on the Germline workflow and Somatic workflow optimally on Lenovo hardware with the GOAST Util Tool.
  • For other Bioinformatics: Install any tools of your choice on GOAST systems or talk to our team about pre-installing your software pipeline of choice. GOAST nodes are configured to support a wide range of bioinformatics workflows.
Table 3. Genomics analytics software and other dependencies pre-installed by GOAST 4.0
Software Version
GATK 4.4.0.0
BWA 0.7.17
BWA-MEM2 2.2.1
Samtools 1.17
Picard Tools 3.0.0
OpenJDK 17.0.3
Snakemake* 7.32.3
SLURM (Optional)* 23.02.4
OS (Recommended) Rocky Linux 9.2

* GOAST systems currently use Snakemake as the workflow manager and job scheduler on a single node setup, and Slurm as the job scheduling system on a multi-node setup.

Cost Effective

GOAST leverages an optimized CPU-based architecture thus it requires no FPGAs or GPUs of any kind for acceleration. Users should expect GPU-like performance for the optimized workflows – at CPU level prices or 50% lower than boutique solutions relying on FPGAs or GPUs and no licensing fees.

The Lenovo Bioinformatics R&D group continually tests new bioinformatics pipelines and releases to its customers hardware-tuned versions of standardized workflows such as the Broad Institute’s GATK Best Practices at no cost.

In addition, GOAST solutions can reduce investments needed to support large-scale projects since a single GOAST Plus server can replace up to 40 standard nodes, reducing hardware, maintenance costs, and other expenses, including power consumption and cooling.

Scalable

The performance of Lenovo GOAST scales linearly from single-node appliance to cluster implementation to serve the needs of labs of all sizes, from small research groups, to commercial labs, and to national population-level projects. This includes transitioning from WES to WGS, undertaking a new project with greater scope and complexity, and expanding both data and users. Scale linearly simply by adding compute and storage building blocks as needed.

Which GOAST Configuration is right for me?

The optimal GOAST configuration depends on your lab’s throughput needs. Both GOAST Intel Base (42.4 WGS/day/node) and GOAST AMD Base (60.0 WGS/day/node) can support the output of an Illumina NovaSeq 6000 at full capacity, which produces 26 samples per day. The purpose of Lenovo GOAST is to help enable research though increasing efficiency and usability. Lenovo will work with you to find the best solution to support your bioinformatics analytics needs.

For more information see NovaSeq 6000 System Specifications:
https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html

Learn More

For more information, see the following resources:

For questions reach out to Dana Alegre, M.S., Solutions Architect, Life Sciences, Lenovo HPC & AI.

Author

Dana Alegre is Lenovo’s Solution Architect in the HPC Life Sciences vertical leading the Lenovo Genomics Optimization and Scalability Tool (GOAST) team. GOAST works to enable research through optimizing genomics workflows on Lenovo hardware. She has been asking and answering biological questions by analyzing next generation sequencing data, first in the Genomics Core at the Stowers Institute for Medical Research and at the Center for Quantitative Life Sciences at Oregon State University. Collaborating with dozens of groups conducting genomics and bioinformatics research over the years, has resulted in publications in Science and Nucleic Acids Research. Her professional accomplishments include developing a bioinformatics pipeline to support the Oregon Health Authority’s efforts to identify and monitor variants in wastewater to help combat the COVID-19 pandemic.

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®

The following terms are trademarks of other companies:

Intel® is a trademark of Intel Corporation or its subsidiaries.

Linux® is the trademark of Linus Torvalds in the U.S. and other countries.

Other company, product, or service names may be trademarks or service marks of others.