Data Science & Machine Learning

Cost-Effective Strategies for Large Scale RNA-Seq Pipeline Computing

Processing bulk RNA-Seq data is crucial for scientific discoveries, as it helps to interpret gene expression effectively.

Pipeline Computing

Large volumes of bulk RNA-seq data can be processed in computational pipelines but these computational demands and tools come at a hefty cost that can sink one’s budget in no time.

At Elucidata, we consistently empower researchers to focus on discovery by enabling access to deeply curated, FAIR data, which in turn, results in improved patient outcomes and better drug discovery. This blog explores the problem of processing large-scale bulk RNA-seq data and offers solutions for mitigating these challenges. Read on to understand more!

Problem Statement: Scaling Ops Without Breaking the Bank

Currently, executing bulk RNA-seq pipelines using the industry’s best methods costs anywhere between $10 to $15 per sample. When implemented on a large scale, with thousands of samples, this expense quickly becomes unsustainable and drains financial resources. 

As the researchers now need thousands of samples,finding the most cost-effective execution method is essential. Running ARCHS4 Pipelines on AWS as mentioned in this research paper - “Massive mining of publicly available RNA-seq data from human and mouse“ has been the most economical option so far. The pipeline is quite cost effective for large batches of data but has its own challenges like missing certain samples, poor alignment, and no quality report. In fact, processing thousands of samples can quickly accelerate the cost to $50,000.Thus, executing Bulk RNA-Seq pipelines cost-effectively is necessary for acquiring high-quality data.

In sync with our vision of quality assurance and cost effectiveness, Elucidata has implemented various strategies to significantly reduce costs, and achieved more than 75% savings as compared to the  industry standards without compromising on the quality of the output. 

The major  mechanisms  implemented include using AWS batch and Spot instances for Bulk RNA-Seq Data processing. In the absence of this cost-effective large scale processing, the monthly exponential cost would have resulted in a major financial burden.

AWS Batch and Spot Instances for Bulk RNA-Seq Data Processing

At Elucidata, we used AWS Batch and spot instances to consistently process bulk RNA-Seq data, deeming it to be a budget-friendly solution. The process proved to be helpful initially, with a smaller size of datasets. However, as the workload expanded, our monthly cloud expenses resembled an accelerating train, escalating by 40-50% each month! We realized that this trend would not be viable, especially if we wanted to run more computationally intensive RNA-Seq pipelines. Thus, we sought a more sustainable approach to execute these pipelines without compromising our financial stability.

Pipeline Computing
AWS Computation & Storage Cost Increase Chart

* The dip in costs in November is due to reduced customer delivery requests.

Challenges with Large-scale Pipeline Computation:

In an attempt to control the increasing costs, we started exploring alternative options for processing such as cloud vendors, on-premise and data center as a service (DCaaS) providers. While evaluating these, we identified possible challenges as mentioned below:

  1. Cost-intensive Services: The foundational expenses for utilizing the computational resources such as AWS EC2 instances, Azure Virtual Machines, or GCP Bare Metal solutions, as offered by the  leading providers like AWS, Azure and GCP, can be exorbitant. While there are more economical options within these providers, like AWS EC2 Spot instances, they come with certain caveats. Additionally,there are other storage services such as S3, EBS, EFS, and FSx, which can also become costlier at a large scale.
  2. Data Security Concerns: Some of the alternate options include non-US-based providers, which poses a challenge for us, as the majority of our customers operate within the US. Our security team has expressed reservations regarding data security, highlighting the need for providers with robust measures that align with US standards.
  3. Inflexible Pricing Models: Since requirements fluctuate often, we require a pay-per-use model to optimize costs. However, most cloud providers require a large upfront commitment for resources, which is not feasible for us.
  4. Limited Compute Capacity: Certain cloud providers lack the computational capacity required for extensive bulk RNA-seq analysis, thereby restricting our capacity to engage with them. This limitation underscores the importance of partnering with providers equipped to handle the demanding computational requirements inherent in large-scale RNA-Seq analysis.
  5. Bandwidth Limitations: Several cloud vendors provide insufficient bandwidth within the cloud environment, or have limitations on data transfer out of the cloud (egress), which can significantly slow down our usage. Since our pipelines deal with large datasets, this has been one of the major problems for us.
  6. Huge Egress Costs: Egress refers to the data transferred out of the cloud, and with certain providers, these costs can be notably steep. This poses a challenge, particularly considering that our pipelines necessitate transferring terabytes of data over the internet to our storage solution, namely AWS S3.
  7. On-premises Server Management Challenges: We have weighed the possibility of managing our own servers within a data center-as-a-service (DCaaS) facility. Yet, the idea of constructing and upkeeping physical infrastructure, encompassing space, power, and specialized skills, is overwhelming. The initial investment and heightened intricacy associated with this route has persuaded us to set it aside for the time being, though we haven't entirely dismissed it for potential exploration.
  8. The Storage Problem: Previously, we relied on AWS S3 to store intermediate data. These intermediaries encompass files such as .bam and .bat before producing the final pristine data. As the intermediate data amounts to terabytes, traditional machines with internal storage have proved to be impractical.They lack flexibility and can’t accommodate our expanding demands. We favor AWS S3 due to its dynamic scalability and cost-effectiveness in storing large volumes of data. However, the significant egress costs associated with AWS S3 pose a challenge, leading to a vendor lock-in scenario.

Breaking Free from AWS Vendor Lock-In: Quest for Alternatives

AWS S3's seemingly attractive storage rates mask a vendor lock-in strategy, with exorbitant egress fees hindering our ability to seamlessly migrate data to non-AWS compute providers. So transferring TBs of intermediary data from AWS S3 to non-AWS compute providers seems out of the question due to extremely high egress costs. This led to a search for a storage solution that prioritized affordability and flexibility. At Elucidata, we have identified the challenges and proceeded to find alternatives with the awareness that a lot of existing  storage solutions pose some challenges.

Pipeline Computing
AWS S3 Usage for Intermediary Data

Further, affordable alternatives to AWS S3 are scarce, and some of our concerns about these alternatives involve:

  1. Limitations in data size handling
  2. S3 API incompatibilities
  3. Restrictive bandwidth limitations
  4. High Egress and API costs
  5. API rate limiting

These factors, therefore, necessitated our transition from using object storage for intermediary data to Network File System (NFS). Upon analyzing our situation, we have inferred  that  AWS S3 is cost-effective when relied upon completely, but  it becomes a drawback in a hybrid cloud setup.

Mitigating Challenges: Embracing Open Source and Efficiency

The path forward involved a strategic shift away from AWS as our primary computing provider and maintaining it as a backup for periods of peak demand.

Moving away from AWS for computing entailed abandoning AWS Batch, its proprietary workload management technology. This shift presented numerous challenges since our systems had been deeply integrated with AWS services, necessitating substantial adjustments.

As it became necessary to replace AWS Batch with a different workload management technology, we evaluated multiple workload managers including:

  1. Slurm Workload Manager
  2. ECS
  3. Seqera Platform
  4. IBM Spectrum LSF
  5. Kubernetes (k8s) 
Eventually, we opted for Kubernetes (k8s) which emerged as the frontrunner due to its extensive documentation, strong community support, scalability, being open source, and existing expertise within our team.

We  also contemplated supplementing our existing ECS cluster with compute machines from third-party providers, enabling both AWS EC2 instances and externally managed machines to operate in tandem. However, the networking aspect of virtually linking machines from different vendors proved to be a cumbersome and cost-ineffective endeavor. Moreover, implementing this approach would have introduced significant complexities in the entire system.

After months’ diligence, we eventually transitioned to Kubernetes (k8s) on a new compute infrastructure which yielded significant benefits as mentioned below:

  1. Kubernetes' flexible workload management allowed us to tailor it to our specific needs. This resulted in a dramatic improvement in resource utilization compared to AWS Batch, and also reduced compute waste significantly.
  2. Kubernetes, being an open-source technology, granted us greater autonomy in managing workload.. Should we opt to transition to alternative compute providers, the process wouldn't pose the same challenges encountered within the AWS ecosystem.
  3. Additionally, we chose bare-metal machines which accelerated execution speeds compared to AWS non-bare metal EC2 machines. 

This approach nonetheless, had its share of minor issues like:

  1. Substantial development effort
  2. Maintenance overhead due to managing Kubernetes and its associated technologies
  3. Escalated complexity resulting from Kubernetes integration

For storage,we transitioned to a Network File System (NFS) mounted directly on the compute machines .This approach provided faster I/O operations than AWS EBS or AWS EFS for intermediary data, further streamlining the processing pipeline. Although in AWS EFS we had an option to increase the speed of data transfer by increasing IOPS and throughput, they were expensive. Finally, the processed data was transferred back to S3 for ingestion into Polly- our data & AI-cloud platform.

The Triumph: Unlocking Scalable and Affordable Bulk RNA-Seq Analysis

The Impact of the Migration owing to the use of bare metal instances and NFS can be seen in:

  • Reduction in the cost up to 85%, and,
  • Accelerated speed by 2.5x, in the overall pipeline execution.
AWS Computation & Storage Cost Decrease as the pipelines got migrated

We have processed thousands of datasets on multiple variations of Bulk RNA-seq pipelines, totaling millions of samples over time. We have scaled our compute capacity to thousands of CPUs, and terabytes of RAM, and NFS storage to hundreds of terabytes at a time-all without breaking the bank. Overall, we've witnessed remarkable scalability, availability, and resilience in our system. We would like to highlight a particular dataset comprising around 700 samples that generated approximately 75TB of intermediate data,  which we have  successfully scaled and executed.

This journey highlights the importance of exploring alternatives, embracing open-source solutions, and prioritizing efficient resource utilization. It stands as a testament to the power of innovation and collaboration in advancing the frontiers of scientific discovery.

Contact Us or reach out to us at info@elucidata.io to learn more.

Blog Categories

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

Blog Categories