Processing bulk RNA-Seq data is crucial for scientific discoveries, as it helps to interpret gene expression effectively.
Large volumes of bulk RNA-seq data can be processed in computational pipelines but these computational demands and tools come at a hefty cost that can sink one’s budget in no time.
At Elucidata, we consistently empower researchers to focus on discovery by enabling access to deeply curated, FAIR data, which in turn, results in improved patient outcomes and better drug discovery. This blog explores the problem of processing large-scale bulk RNA-seq data and offers solutions for mitigating these challenges. Read on to understand more!
Currently, executing bulk RNA-seq pipelines using the industry’s best methods costs anywhere between $10 to $15 per sample. When implemented on a large scale, with thousands of samples, this expense quickly becomes unsustainable and drains financial resources.
As the researchers now need thousands of samples,finding the most cost-effective execution method is essential. Running ARCHS4 Pipelines on AWS as mentioned in this research paper - “Massive mining of publicly available RNA-seq data from human and mouse“ has been the most economical option so far. The pipeline is quite cost effective for large batches of data but has its own challenges like missing certain samples, poor alignment, and no quality report. In fact, processing thousands of samples can quickly accelerate the cost to $50,000.Thus, executing Bulk RNA-Seq pipelines cost-effectively is necessary for acquiring high-quality data.
In sync with our vision of quality assurance and cost effectiveness, Elucidata has implemented various strategies to significantly reduce costs, and achieved more than 75% savings as compared to the industry standards without compromising on the quality of the output.
The major mechanisms implemented include using AWS batch and Spot instances for Bulk RNA-Seq Data processing. In the absence of this cost-effective large scale processing, the monthly exponential cost would have resulted in a major financial burden.
At Elucidata, we used AWS Batch and spot instances to consistently process bulk RNA-Seq data, deeming it to be a budget-friendly solution. The process proved to be helpful initially, with a smaller size of datasets. However, as the workload expanded, our monthly cloud expenses resembled an accelerating train, escalating by 40-50% each month! We realized that this trend would not be viable, especially if we wanted to run more computationally intensive RNA-Seq pipelines. Thus, we sought a more sustainable approach to execute these pipelines without compromising our financial stability.
* The dip in costs in November is due to reduced customer delivery requests.
In an attempt to control the increasing costs, we started exploring alternative options for processing such as cloud vendors, on-premise and data center as a service (DCaaS) providers. While evaluating these, we identified possible challenges as mentioned below:
AWS S3's seemingly attractive storage rates mask a vendor lock-in strategy, with exorbitant egress fees hindering our ability to seamlessly migrate data to non-AWS compute providers. So transferring TBs of intermediary data from AWS S3 to non-AWS compute providers seems out of the question due to extremely high egress costs. This led to a search for a storage solution that prioritized affordability and flexibility. At Elucidata, we have identified the challenges and proceeded to find alternatives with the awareness that a lot of existing storage solutions pose some challenges.
Further, affordable alternatives to AWS S3 are scarce, and some of our concerns about these alternatives involve:
These factors, therefore, necessitated our transition from using object storage for intermediary data to Network File System (NFS). Upon analyzing our situation, we have inferred that AWS S3 is cost-effective when relied upon completely, but it becomes a drawback in a hybrid cloud setup.
The path forward involved a strategic shift away from AWS as our primary computing provider and maintaining it as a backup for periods of peak demand.
Moving away from AWS for computing entailed abandoning AWS Batch, its proprietary workload management technology. This shift presented numerous challenges since our systems had been deeply integrated with AWS services, necessitating substantial adjustments.
As it became necessary to replace AWS Batch with a different workload management technology, we evaluated multiple workload managers including:
Eventually, we opted for Kubernetes (k8s) which emerged as the frontrunner due to its extensive documentation, strong community support, scalability, being open source, and existing expertise within our team.
We also contemplated supplementing our existing ECS cluster with compute machines from third-party providers, enabling both AWS EC2 instances and externally managed machines to operate in tandem. However, the networking aspect of virtually linking machines from different vendors proved to be a cumbersome and cost-ineffective endeavor. Moreover, implementing this approach would have introduced significant complexities in the entire system.
After months’ diligence, we eventually transitioned to Kubernetes (k8s) on a new compute infrastructure which yielded significant benefits as mentioned below:
This approach nonetheless, had its share of minor issues like:
For storage,we transitioned to a Network File System (NFS) mounted directly on the compute machines .This approach provided faster I/O operations than AWS EBS or AWS EFS for intermediary data, further streamlining the processing pipeline. Although in AWS EFS we had an option to increase the speed of data transfer by increasing IOPS and throughput, they were expensive. Finally, the processed data was transferred back to S3 for ingestion into Polly- our data & AI-cloud platform.
The Impact of the Migration owing to the use of bare metal instances and NFS can be seen in:
We have processed thousands of datasets on multiple variations of Bulk RNA-seq pipelines, totaling millions of samples over time. We have scaled our compute capacity to thousands of CPUs, and terabytes of RAM, and NFS storage to hundreds of terabytes at a time-all without breaking the bank. Overall, we've witnessed remarkable scalability, availability, and resilience in our system. We would like to highlight a particular dataset comprising around 700 samples that generated approximately 75TB of intermediate data, which we have successfully scaled and executed.
This journey highlights the importance of exploring alternatives, embracing open-source solutions, and prioritizing efficient resource utilization. It stands as a testament to the power of innovation and collaboration in advancing the frontiers of scientific discovery.
Contact Us or reach out to us at info@elucidata.io to learn more.