Containers, Automation, and Nextflow Accelerate Genomic Data Processing

August 22 2019

From climate-resistant crops to the ability to predict the level of disease risk in healthy individuals, genomics — the study of the full genetic balance of an organism — holds great promise for the world. It’s a data-intensive field where reproducibility, efficient management, and fast processing of large amounts of data are essential.

One thing that slows down that processing is doing things manually. That was the case for a recent ClearScale customer. Lack of the right IT architecture required one of the company’s key project teams to process data samples manually, which was causing delays in the overall workflow. The time-consuming endeavor entailed going to the data source, downloading the samples, gathering the input parameters, uploading the data to Amazon Elastic Compute Cloud (EC2) instances and running a variety of data pipeline steps.

Knowing that ClearScale had specific expertise in working with AWS services and in Big Data applications and automation, the customer requested assistance in automating some of its processes — including processing data samples.

Containers, Pipelines, and Workloads

The first step was for the ClearScale team to gather and review the company’s current workflow and business requirements. The team determined that creating and automating data pipelines was the optimal solution. Pipelines are created to process data in steps consisting of different tools where the output produced by one step is passed on as input to the next step.

Various AWS services were evaluated for use in architecting the solution. The team chose to go with Amazon Elastic Container Service (Amazon ECS), a highly scalable, high-performance container orchestration service. The idea was to create containers for all the steps for processing data samples without the customer having to install and operate its own container orchestration software. Nor would it have to manage and scale a cluster of virtual machines (VMs) or schedule containers on the VMs. That would significantly reduce capital investments.

Nextflow, a free, flexible open-source software, was selected to enable scalable, reproducible scientific workflows using the containers. Nextflow includes built-in support for AWS Batch, a managed computing service that runs containerized workloads over Amazon ECS. The use of AWS Batch allows for seamless deployment of Nextflow pipelines in the cloud by offloading the process executions as managed batch jobs. The service spins up the required computing instances on-demand, scaling up and down the number and composition of the instances to accommodate the actual workload resource needs at any point in time. That flexibility could yield cost savings as well.

The Custom Architecture

The Clearscale team then developed the architecture based on an AWS solution for running workflows with EC2 instances pre-configured for Nextflow. The team modified the solution by incorporating Cell Ranger, a set of analysis pipelines; Perl, a programming language; ingestion containers for different Nextflow processes; AWS CodeBuild jobs; custom job definitions; and other components.

With the customized solution tested, documented and deployed, the customer is now able to process data faster. That can lead to accelerated analyses and, ultimately, faster time to market for the company’s products. The solution also enables the company to efficiently scale the required resources to meet demand and then scale them back when the demand is gone for more cost savings.

The Rest of the Story

Increasingly, companies involved in genomics and fields such as biology, drug discovery, and molecular diagnostics are reaching out to ClearScale for assistance in developing custom architecture and infrastructure to optimize and accelerate their data-handling processes and workflows. While we don’t bill ourselves as genomics experts, we do have extensive experience in using the vast array of AWS services created to support data pipeline development and deployment, as well as automation, cloud migration and much more. That experience is invaluable to genomics companies as well as to those in other fields.

Get in touch today to speak with a Cloud expert and discuss how we can help:

Call us at 1-800-591-0442
Send us an email: sales@clearscale.net
Fill out a Contact Form
Read our Customer Case Studies

San Francisco

Headquarters

71 Stevenson St.

Suite 400

San Francisco, CA 94105

O: 1-800-591-0442

F: 1-415-655-6601

San Jose

5450 Thornwood Dr

Suite #L

San Jose, CA 95123

Denver

1400 16th Street,

Suite 400

Denver, CO 80202

O: 1-720-932-8028

Phoenix

2942 N 24th Street,

Suite 114

Phoenix, AZ 85016

O: 1-602-560-1198

New York

165 Broadway, 23rd Floor

New York City, NY 10006

O: 1-646-759-3656

Houston

11757 Katy Freeway

Suite 1300

Houston, Texas 77079

O: 1-281-854-2088

Toronto

100 King Street West

Suite 5600

Toronto, Ontario, M5X 1C9

O: 1-416-479-5447

About Us  |  Careers  |  Privacy Policy
@ Subscribe
Share