Collecting and Enriching Data by Leveraging Snowplow for Deep Analytics

March 29 2018
image

The Challenge

Every second of every day, billions of data points are created in our online world. Making sense of all of the data and trying to find a singular piece of information that is useful, let alone actionable, is a daunting task. In the fast-moving world of news and media, being able to identify pertinent information out of all of the noise is not just important but a necessity.

A large company in the global news space discovered this first hand as they attempted to stand up a beta news website. With thousands of raw data the website was obtaining each second, it was clear that their custom-built application was not up to the task of identifying of sifting through the raw data, cleansing it, or enriching it in a way that would be useful for the consumer of the information.

The company approached ClearScale, an AWS Premier Consulting Partner, asking them to evaluate Snowplow to see if it was a suitable solution and, if so, how they could implement it in a way that would complement their beta news website. ClearScale immediately set about analyzing the client’s current implementation, their requirements, and then Snowplow to see how they could provide the desired outcome.

The ClearScale Solution: Using Snowplow to Collect and Enrich Data

Snowplow is a scalable and flexible application that allows for separate data collectors and data enrichers. Collecting data is a relatively simple process to set up, but when it comes to cleansing and enriching the data it becomes more challenging. Depending on the data being enriched or how in-demand the enrichment requests are, traditional cleansing approaches are resource-intensive, requiring extensive periods of time to perform the operations before the next request can start.

With Snowplow, a customer can leverage its native AWS implementation to recognize economies of scale. By implementing AWS Kinesis, Snowplow can obtain a steady stream of fresh data and analyze it as it comes in. The data is streamed from the beta news website, the collector analyzes it to determine what data is good and can move onto enrich process, and which data is bad and is weeded out. Good data is then streamed in Kinesis and undergoes an enrichment process and those data sets that have been successfully enriched are then committed to S3 for batch data flow and to Kinesis for real time data flow. Two separate EMR clusters were created for these workloads, both leveraging Apache Spark and one of them leveraging Spark Streaming extension. Outputs are then stored on Druid data store for use in data analysis tools, in reporting, or accessible via the web.

Collector and Enrich Logical Diagram

alt

It is important to note that in order to accomplish this successful implementation of Snowplow, ClearScale needed to be certain that the data collectors and the data enrichers were able to scale independent of one another. Not only would this allow for the varying rates and volumes of data that had to be collected and enriched as needed, but it meant that the client could potentially have fewer collectors with more enrichers active at any one point in time, thus allowing for a quicker turnaround time for cleansed data for analysis. ClearScale went about designing and implementing two separate and independently managed auto-scaling groups to be certain that the client’s cleansed data would be near real-time.

By approaching it in this manner, ClearScale also ensured that any new data types or data sets that needed to be analyzed in the future would take minimal effort by the client to set up. It also gave the added benefit of giving the client more flexibility in creating complex enrichment rules without much effort or custom development without the angst of impacting overall performance of the solution and architecture in the future.

The Benefits

With Snowplow, ClearScale was able to deliver a solution that allowed the client to stream in numerous data sets every second and have them properly cleansed and enriched very quickly, thereby allowing for quick analysis. Where once it took minutes or potentially hours to analyze data, the client could now do so in a matter of seconds. This near-real time feedback loop was possible due in large part to how Snowplow has been implemented and leverages AWS Kinesis. With the continued growth of AWS services, the providers of the Snowplow application foresee continued evolution of the product in the years to come that customers will benefit from.

ClearScale’s ability to quickly analyze the client’s desired need to use Snowplow has given the client the ability to quickly analyze data without worrying about how much data is needing to be processed. This in turn means that the client has an advantage over the competition in being able to identify in near real-time news and media events that are important to its customer base.

ClearScale’s time-proven results have demonstrated our ability to assess a client’s true needs, design and implement solutions that are robust and scalable, and serve in the role as true partners in whatever endeavor your organization wants to undertake.

Get in touch today to speak with a cloud expert and discuss how we can help:

Call us at 1-800-591-0442
Send us an email at sales@clearscale.com
Fill out a Contact Form
Read our Customer Case Studies

San Francisco

One Market St., Spear Tower

Suite 3600

San Francisco, CA 94105

O: 1-800-591-0442

San Jose

5450 Thornwood Dr

Suite #L

San Jose, CA 95123

Denver

1400 16th Street,

Suite 400

Denver, CO 80202

O: 1-720-932-8028

Phoenix

2942 N 24th Street,

Suite 114

Phoenix, AZ 85016

O: 1-602-560-1198

New York

165 Broadway, 23rd Floor

New York City, NY 10006

O: 1-646-759-3656

Houston

11757 Katy Freeway

Suite 1300

Houston, Texas 77079

O: 1-281-854-2088

Toronto

100 King Street West

Suite 5600

Toronto, Ontario, M5X 1C9

O: 1-416-479-5447

About Us  |  Careers  |  Privacy Policy
Share