Building Better Data Pipelines with AWS Step Functions

October 22 2019

When it comes to data management, there are three important components. There’s generating the data, which is often called online transaction processing (OLTP). There’s analyzing the data, which is referred to as online analytical processing (OLAP). Both can involve multiple systems.

Then there’s the process of moving data between systems. This can include copying data, moving it from on-premise to the cloud, reformatting it, combining it with other data sources, and other steps. Each step can require separate software. That’s where the data pipeline comes in.

A data pipeline enables a smooth, automated flow of data from one point to the next. It defines what, where, and how data is collected. It then automates the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. It provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency.

The Customer Need

For one ClearScale client, however, its analytics pipeline, along with its MySQL database, was proving to be the bottleneck in operating its AWS-based technology platform. Anticipating a significant workload increase on the platform that would intensify the bottlenecks, the company reached out to ClearScale to help remedy the situation.

Specifically, the company needed a new AWS-centric data pipeline solution that would mitigate the bottlenecks while offering greater scalability to handle increased workloads. The solution also needed to be more cost-efficient and not require as many staff resources for management and administration.

There are a variety of ways to manage data. But for many AWS data management projects, AWS Data Pipeline is seen as the go-to service for processing and moving data between AWS compute and storage services and on-premise data sources. It’s known for helping to create complex data processing workloads that are fault-tolerant, repeatable, and highly available.

However, it’s not the only option — or even the only option for every situation.

A Creative Approach to Data Management

ClearScale determined that AWS Data Pipeline lacked the flexibility to meet this particular customer’s needs. After evaluating various options, the ClearScale team decided to take a unique approach by using AWS Step Functions, a general-purpose workflow management tool, for data pipeline orchestration.

Developed for orchestrating complex flows using Lambda functions, it’s primarily used for app development. But ClearScale determined how to use its attributes for data pipeline orchestration and combine it with other AWS services to create a solution that could best meet the customer’s needs.

AWS Step Functions work by coordinating multiple AWS services into serverless workflows. There are no costs or personnel required for provisioning, scaling, and managing servers.

The workflows are comprised of a series of steps, with the output of one acting as the input for the next, and translated into easy-to-understand state machine diagrams. Step Functions automatically trigger and track each step and retry when there are errors. As a result, the steps execute in order and as expected. Step Functions also log the state of each step, so that any problems can be diagnosed and debugged quickly. This all parallels well with the steps associated with data pipelines.

The Rest of the Solution

Amazon S3 was selected as the primary storage platform for the solution’s associated data lake because of its virtually unlimited scalability. It can be seamlessly and non-disruptively increased, with the customer only paying for what is used. It’s designed to provide 99.999999999% durability and has native encryption and access control capabilities. All data types can be stored in their native formats. It also integrates with services such as Amazon Athena and AWS Glue to query and process data, as well as with AWS Lambda serverless computing to run code without provisioning or managing servers.

The solution also uses Amazon Athena, a serverless, interactive query service that makes it easy to analyze data in Amazon S3 using well-known SQL. The customer only pays for the amount of scanned data. Plus, there’s no need for complex ETL jobs to prepare data for analysis.

Amazon Kinesis is employed to collect, process, and analyze real-time, streaming data at any scale, including video, audio, application logs, website clickstreams, and other telemetry data. It allows for processing and analyzing data as it arrives instead of waiting until all data is collected.

Amazon Aurora is used for the database. It’s three times faster than standard MySQL databases and provides the security, availability, and reliability of commercial databases at 1/10th the cost. It’s fully managed by Amazon RDS, which automates tasks such as database setup, patching, and backups. It delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across three availability zones (AZs).

In addition, the use of Aurora auto-scaling dynamically adjusts the number of Aurora replicas provisioned for an Aurora DB cluster using single-master replication. This enables the Aurora DB cluster to handle sudden workload increases. When the workload decreases, unneeded replicas are removed, so the customer doesn’t have to pay for unused provisioned DB instances.

The Results

ClearScale’s innovative solution generated serverless analytics and data pipelines for the customer that have reduced its administrative costs for data management. Data processing is faster. The analytics are far more granular and beneficial. The overall process is more secure and reliable.

The solution also has repercussions for data management in general. ClearScale’s creative approach is spurring discussions on and investigations into the use of AWS Step Functions for building “better data pipelines” — proving once again that ClearScale is truly at the forefront of the Big Data and app development industries.

Learn how ClearScale’s pioneering spirit and vast expertise can benefit your organization.

Get in touch today to speak with a Cloud expert and discuss how we can help:

Call us at 1-800-591-0442
Send us an email: sales@clearscale.net
Fill out a Contact Form
Read our Customer Case Studies

San Francisco

Headquarters

71 Stevenson St.

Suite 400

San Francisco, CA 94105

O: 1-800-591-0442

F: 1-415-655-6601

San Jose

5450 Thornwood Dr

Suite #L

San Jose, CA 95123

Denver

1400 16th Street,

Suite 400

Denver, CO 80202

O: 1-720-932-8028

Phoenix

2942 N 24th Street,

Suite 114

Phoenix, AZ 85016

O: 1-602-560-1198

New York

165 Broadway, 23rd Floor

New York City, NY 10006

O: 1-646-759-3656

Houston

11757 Katy Freeway

Suite 1300

Houston, Texas 77079

O: 1-281-854-2088

Toronto

100 King Street West

Suite 5600

Toronto, Ontario, M5X 1C9

O: 1-416-479-5447

About Us  |  Careers  |  Privacy Policy
@ Subscribe
Share