Leveraging the Power of AWS Athena for Large Scale Big Data Queries

November 02 2017

The Challenge

As an organization grows, so does the amount of data it stores, either from information it creates on its own, or from an aggregation of data from multiple sources. From a business operations perspective, trying to glean useful information from terabytes worth of data in a quick and expedient way can become challenging as the amount of data grows or if multiple, complex joins are needed.

A global news organization discovered this fact several years ago as they were attempting to query information from their existing EC2 data stores in addition to being able to query from S3 stored in JSON format as well as from AWS Kinesis Firehose streaming data service. Their SQL queries were becoming more and more complex and the time to get results back were increasing based on the complexity of the query and due in no small part to the need for the client to scale clusters to accommodate the more expensive SQL queries.

They asked ClearScale, an AWS Certified Premier Partner, to find ways of optimizing their queries that would allow them to dive deep into their Big Data repositories but without the associated delay in reporting the results. They knew that ClearScale’s expertise in the AWS services ecosystem would likely find solutions to this common, yet complex problem.

The two issues at the top of the client’s list of concerns were a combination of performance and cost. The more complex the query, the costlier the use of resources needed to perform that query and the latency of the delivered results.

The ClearScale Solution

Once ClearScale had evaluated the data schema and queries the client was planning to use, it was quickly apparent that the solution to the client’s issues was implementing the AWS Athena - a serverless querying technology that allows customers to query large data sources without the need for managing servers or data warehouses. Based on ANSI standard SQL, and using standard formats such as CSV, JSON, ORC, Avro and Parquet, Athena is built to allow a user to point to specific data sets in their Amazon S3 instance, configure the schema they would like to use, and then quickly execute queries with Athena’s built-in query editor.

By aggregating all of the various data the customer had access to in their EC2, S3 and AWS Kinesis Firehose streaming data service all into a centralized S3 bucket, it would allow Athena to rapidly query the data and return results. By running queries in parallel, Athena is able to quickly query the data, regardless of the size of the data set or the number of complex joins, and return results quickly and usually within seconds.

Moreover, because of how Amazon has chosen to implement Athena, customers only pay for queries that are run. Most customers can save anywhere between 30% to 90% of their per-query costs over traditional query requests through non-Athena implementations. This is accomplished in part by compressing, partitioning and converting data into columnar formats which allows for faster queries over larger data sets.

The result for our client was apparent. Prior to engaging with ClearScale, the customer spent hours attempting to query the data before they were able to analyze the results, including time spent prior to the query working with their infrastructure team to perform extract, transform and load (ETL) operations on the data. With Athena in place, the results based on the same complex SQL queries took seconds to run against millions of objects. This overwhelming improvement of query performance allowed the client to spend more time analyzing the results to discover trends and valuable information.

The Benefits

The AWS Athena implementation was the ideal solution for a number of reasons. Not only did it perform better than their prior operational model, but because Athena uses ANSI SQL, it was a perfect fit for the client’s data science team since they were already very familiar with ANSI SQL queries.

Moreover, because the AWS service is serverless, there were no issues with scalability due to server constraints that the infrastructure team had to be involved in for prior to the Athena solution. Finally, because the client only had to pay per query and based on how much data they actually queried against, combined with the cost savings they recognized since they no longer had to maintain their own server environments, they were able to realize a significant operational cost savings.

ClearScale continues to work closely with this global news organization as they begin to fully realize the potential of the AWS Athena solution. As the client becomes more familiar with the query power available to them, ClearScale will be there to help usher in additional refinements to their workflows and feature requests, all aimed at being able to get valuable results out of increasingly complex queries. For ClearScale, success is not defined as delivering a completed project to a client; but rather is defined by making sure that our clients are successful from now and into the future.

Get in touch today to speak with a Cloud expert and discuss how we can help:

Call us at 1-800-591-0442
Send us an email: sales@clearscale.net
Fill out a Contact Form
Read our Customer Case Studies

San Francisco

Headquarters

71 Stevenson St.

Suite 400

San Francisco, CA 94105

O: 1-800-591-0442

F: 1-415-655-6601

San Jose

5450 Thornwood Dr

Suite #L

San Jose, CA 95123

Denver

1400 16th Street,

Suite 400

Denver, CO 80202

O: 1-720-932-8028

Phoenix

2942 N 24th Street,

Suite 114

Phoenix, AZ 85016

O: 1-602-560-1198

New York

165 Broadway, 23rd Floor

New York City, NY 10006

O: 1-646-759-3656

Houston

11757 Katy Freeway

Suite 1300

Houston, Texas 77079

O: 1-281-854-2088

Toronto

100 King Street West

Suite 5600

Toronto, Ontario, M5X 1C9

O: 1-416-479-5447

About Us  |  Careers  |  Privacy Policy
@ Subscribe
Share