Data Ingestion Pipeline for Big Data Aggregation and Analysis

March 11 2019

The Challenge

A financial analytics company's data analysis application had proved highly successful, but that success was also a problem. With a growing number of isolated data centers generating constant data streams, it was increasingly difficult to efficiently gather, store, and analyze all that data. The company knew a cloud-based Big Data analytics infrastructure would help, specifically a data ingestion pipeline that could aggregate data streams from individual data centers into a central cloud-based data storage.

One of the challenges in implementing a data pipeline is determining which design will best meet a company’s specific needs. Data pipeline architecture can be complicated, and there are many ways to develop and deploy them. Each has its advantages and disadvantages.

The company requested ClearScale to develop a proof-of-concept (PoC) for an optimal data ingestion pipeline. The solution would be built using Amazon Web Services (AWS). In addition, ClearScale was asked to develop a plan for testing and evaluating the PoC for performance and correctness.

The ClearScale Solution

ClearScale kicked off the project by reviewing its client’s business requirements, the overall design considerations, the project objectives and AWS best practices.

In addition to the desired functionality, the prototype had to satisfy the needs of various users. That included analysts running ad-hoc queries on raw or aggregated data in the cloud storage; operations engineers monitoring the state of the ingestion pipeline and troubleshooting issues; and operations managers adding or removing upstream data centers to the pipeline configuration.

To make the best use of AWS and meet the client’s specific application needs, it was determined the PoC would be comprised of the following:

• Data center-local clusters to aggregate data from the local data center into one location

• A stream of data from the data center-local clusters into AWS S3

• Amazon S3-based storage for raw and aggregated data

• An Extract, Transform, Load (ETL) pipeline, a continuously running AWS Glue job that consumes data and stores it in cloud storage

• An interactive ad-hoc query system that is responsible for facilitating ad hoc queries on cloud storage

Data Pipeline Diagram alt

However, the nature of how the analytics application works — gathering data from constant streams from multiple isolated data centers — presented issues that still to be addressed. Among them:

• Event time vs. processing time — SQL clients must efficiently filter events by event creation time, or the moment when event has been triggered, instead of event processing time, or the moment of time when the event has been processed by the ETL pipeline.

• Backdated and lagging events — There can be several circumstances where events from one data center lag behind events produced by other data centers.

• Duplicate events — In the event of failures or network outages, the ETL pipeline must be able to de-duplicate the event stream to prevent SQL clients from seeing the duplicate entries in cloud storage.

• Event latency — The target is one-minute latency between an event being read from the on-premise cluster and being available for queries in cloud storage.

• Efficient queries and small files — Cloud storage doesn’t support appending data to existing files. Ensuring one-minute latencies would mean the data in the cloud storage would have to be stored in small files corresponding to one-minute intervals, where the number of files can be extremely large.

ClearScale overcame these issues by outlining the following workflow for the ETL process:

• _____ingests streams from the datacenter to the cloud, allowing for duplicate and out-of-order events to happen.

• AWS Glue job writes event data to raw intermediate storage partitioned by processing time, ensuring exactly-once semantics for the delivered events.

• A periodic job fetches unprocessed partitions from the staging area and merges them into the processed area.

• After the data is written, the job updates the Glue Data Catalog to make the new/updated partitions available to the clients.

Architecting a PoC data pipeline is one thing; ensuring it meets its stated goals — and actually works — is another. To ensure both, ClearScale also developed, executed, and documented a testing plan.

The testing methodology employs three parts. The PoC pipeline uses the original architecture but with synthetic consumers instead of ETL consumers. The test driver simulates a remote data center by running a load generator.

With test objectives, metrics, setup, and results evaluation clearly documented, ClearScale was able to conduct the required tests, evaluate the results, and work with the client to determine next steps.

The Benefits

ClearScale’s PoC for a data ingestion pipeline has helped the client build a powerful business case for moving forward with building out a new data analytics infrastructure. Best practices have been implemented. Potential issues have been identified and corrected. Enhancements can continue to be made.

Once up and running, the data ingestion pipeline will simplify and speed up data aggregation from constant data streams generated by an ever-growing number of data centers. Data will be stored in secure, centralized cloud storage where it can more easily be analyzed. As a result, the client will be able to enhance service delivery and boost customer satisfaction.

From proof of concepts to production environments, ClearScale helps companies develop and implement technology solutions to meet their most complex needs. A full range of professional cloud services are available, including architecture design, integration, migration, automation, management, and application development.

Get in touch today to speak with a cloud expert and discuss how we can help:

Call us at 1-800-591-0442
Send us an email at
Fill out a Contact Form
Read our Customer Case Studies