1-800-591-0442 | 24/7 Live Support Location | Careers | Contact Us

Ways to Optimize Data Ingestion and Analysis with AWS Glue

August 27 2018

The wealth of data and information that resides in databases from one organization to the next is only as valuable or useful as the tools that are able to extract that information and the subsequent analysis that uncovers usable knowledge. As companies move their applications and data into Cloud implementations, the ability to aggregate many disparate data sources into a single data store and be able to easily query the data becomes critical for their continued operations.

There are many challenges that come with attempting to consolidate different data sources and schemas into a single location. Either time needs to be spent during the migration of the data to normalize the data into a common schema, or the tools that ultimately query that combined data need to be robust enough to be able to handle the complexities of different data sets without impacting report generation latency.

The Challenge

A recent client of ClearScale, an AWS Premier Consulting Partner, experienced this concerning scenario. The organization had amassed an enormous amount of customer transactional data from a number of different sources in a diversity of data formats. From store-level transactions to coupon redemption, app clicks, email clicks and social media, the client had many different views of its customers’ buying behavior.

With all of this wealth of information, the client struggled with finding a way to distill it down into a singular view of a customer’s shopping and buying habits. With the goal being to increase sales by leveraging customer purchasing behavior through direct marketing offers, the client needed to find a way to not only consolidate the data, but find ways to analyze it effectively.

The ClearScale Solution

In a traditional data warehouse model, ClearScale would have typically taken the client’s data and stored it in the warehouse after cleaning it up and normalizing it utilizing Extract-Transform-Load (ETL) processes. Unfortunately, this approach using existing ETL ingestion technologies can be challenging to set up and maintain as the ETL processes have to be constantly tweaked to account for ever-changing datasets or schemas as new data sources become available.

One way around this problem is to not spend the time to set up normalized ETL ingestion processes, but instead allow the data to live in its natural state, albeit in a centralized location. In this approach, it then falls upon the shoulders of the analysis tools to normalize the data before returning the expected results. This is not ideal because the process of normalizing the data can not only increase the latency of generating reports, but potentially cause errors in the resulting data results due to issues resulting in the transformation/normalization process.

To avoid this, ClearScale decided a data lake model would be a more viable approach. In this model there are three components that make this approach ideal. First, a data store able to hold arbitrarily formatted data. Second, a data processing / transformation engine to move the data around and reformat it. Finally, a query engine that works well with semi-structured data during the data analysis phase of generating reports. ClearScale determined that in order to successfully implement a solution like this that they would need to rely on AWS Glue, a service designed to create the base data schema and ETL functionality that would allow for the data to be transformed for easier processing later.

alt

Figure 1- A Typical AWS Glue ETL Model

ClearScale executed against this plan by setting up S3 buckets for the data that needed to be stored and then integrated with the client’s data warehouse and data feeds to ingest the data into the S3 instances. By then ingesting a sample set of data into the S3 buckets, ClearScale was then able to leverage the power of AWS Glue Data Catalog Crawler to create the initial database schema. ClearScale then used AWS Athena to perform a test-run against the schemas and fixed issues with the schema manually until Athena was able to perform a complete test-run without errors from the data catalog.

Once this effort was completed, ClearScale then worked with the AWS Glue Development Endpoints to create ETL Jobs for each data source. To do this, a data set was read from a data source specified by AWS Glue Data Catalog and then the data set’s schema was mapped from input to output. If necessary, additional transformations were created until the ideal mapping was achieved and the dataset output was saved into the S3 buckets. Finally, this development endpoint script was converted into AWS Glue ETL jobs and tested against the sample data.

Even with this extensive and robust approach to aggregation and transformation of data, there were bound to be issues requiring optimization of the entire pipeline. As an example, ClearScale encountered a situation where one of the jobs began performing poorly; a 1 GigaByte dataset was processing for more than 12 hours on a 10 Data Processing Units (DPUs).

To identify root cause, ClearScale ran the job on an AWS Glue Development Endpoint which has a built in Spark debugging tools such as Spark UI or Spark History Server. They also went through the logs in AWS CloudWatch and reviewed the extended metrics to identify areas that might be causing bottlenecks in the ETL processing. Through the identification of these bottlenecks, ClearScale was then able to address areas of concern in the ETL Jobs. The end result was an order of magnitude improvement in the job time with more improvements coming.

The Benefits

By taking this approach, ClearScale was able to consolidate all the data the client had acquired over time and transformed it in a way that allowed them to query, analyze and then individually target market to each unique customer based on the customer’s buying behaviors resulting in increased revenue generation.

From a technology perspective, implementing AWS Glue within the client’s AWS account allowed to provide a stable foundation of the future data projects and queries. Finally, it gave the client the opportunity to leverage other AWS services, such as Redshift or Athena and then overlay those with business intelligence and analytic toolsets, like Tableau, for data mining and reporting.

ClearScale’s technical acumen, accumulated over many years and numerous client projects, has allowed us to be recognized by Amazon as a Premier Consulting Partner with multiple competencies. Our approach to solving our clients’ needs stems from our ability to recognize and understand underlying operational concerns and solve for those particular issues.

Get in touch today to speak with a Cloud expert and discuss how we can help:

Call us at 1-800-591-0442
Send us an email: sales@clearscale.net
Fill out a Contact Form
Read our Customer Case Studies

San Francisco

Headquarters

71 Stevenson St.

Suite 400

San Francisco, CA 94105

O: 1-800-591-0442

F: 1-415-655-6601

San Jose

5450 Thornwood Dr Suite #L

San Jose, CA 95123

Denver

1400 16th Street,

Suite 400

Denver, CO 80202

O: 1-720-932-8028

Phoenix

1910 S. Stapley Drive,

Suite 221

Mesa, AZ 85204

O: 1-480-386-5057

New York

165 Broadway, 23rd Floor

New York City, NY 10006

O: 1-646-759-3656

Toronto

100 King Street West

Suite 5600

Toronto, Ontario, M5X 1C9

O: 1-416-479-5447

© 2017 ClearScale, LLC. All Rights Reserved.    About Us  |  Careers  |  Privacy Policy
Share