1-800-591-0442 | 24/7 Live Support Location | Careers | Contact Us

Using AWS Batch to Analyze and Extract Information from Large Document Data Stores

September 12 2018

The growth of data stores has grown significantly over the last decade, especially with the introduction of IoT managed devices, such as medical devices.

From a business perspective, attempting to run reporting and analysis against an ever-growing data store presents issues with some technologies due to the processing time it takes to do transformation routines on increasing volumes of information. When dealing with large volumes of data, finding ways to easily apply transformations and enrichment to data objects can be challenging.

One such client of ClearScale’s, an AWS Premier Consulting Partner, had a similar problem when they requested ClearScale to determine the best approach to this conundrum. They needed a data store that could store large volumes of unstructured and structured data and in turn be able to search and manage these documents. In order to perform searches, either the documents would have metadata attributes, such as date created, author information, or via the contents in the document.

The Challenge

On the surface, the storage portion of the solution was the easy decision. ClearScale determined that by using AWS S3 buckets utilizing AWS ElasticSearch Service for indexing that it would provide the fundamental components of the final solution. Once established, ClearScale could then set up an Extract-Transform-Load (ETL) pipeline that would ingest data from the S3 bucket directly into ElasticSearch for indexing.

However, while in the ETL pipeline the data would need to be augmented, transformed and enhanced to allow for the ElasticSearch indexing to do its job and this was the challenge that ClearScale faced. Normally, utilizing AWS Lambda could perform these actions, but given the volume of data or documents that needed to be modified in the ETL pipeline, it was apparent that Lambda would not be able to keep pace with the task. ClearScale needed a managed compute engine that would work for longer-running tasks.

The Solution

ClearScale determined one of the optimal ways to solve this issue was to leverage AWS Batch. Using Batch, ClearScale was able to extract text from large documents, often on the order of hundreds of megabytes of data, and then splitting them into chunks for easier indexing. Batch also allowed ClearScale to enumerate large S3 buckets with hundreds of thousands of documents when doing bulk import operations.

Batch was determined to be the best solution for a variety of reasons beyond these two critical functional aspects. AWS Batch automates the scaling based on the volume of incoming tasks to be queued. It also provides extensive monitoring and control features to manage batch processing and ultimately is agnostic of what its processing, so long as the job is packaged in a Docker container image.

alt

The Benefit

Ultimately, the solution that ClearScale designed and implemented for the client was viewed as a complete success. It gave the client just what they needed: the ability to support ingestion of large documents and groups of documents by utilizing S3 buckets, and the ability to handle rate limits imposed by AWS ElasticSearch indexing by leveraging AWS Batch for processing and transforming the data.

ClearScale believes that success for any client engagement goes beyond just delivering a solution. The fundamental challenge to any project is the need to understand what the client ultimately needs and ClearScale dedicates substantial resources to understanding every nuance of a client request. The result that is delivered often exceeds the expectations the client has and allows them to leverage the solution for their business operations in ways they never thought possible.

Get in touch today to speak with a Cloud expert and discuss how we can help:

Call us at 1-800-591-0442
Send us an email: sales@clearscale.net
Fill out a Contact Form
Read our Customer Case Studies

San Francisco

Headquarters

71 Stevenson St.

Suite 400

San Francisco, CA 94105

O: 1-800-591-0442

F: 1-415-655-6601

San Jose

5450 Thornwood Dr Suite #L

San Jose, CA 95123

Denver

1400 16th Street,

Suite 400

Denver, CO 80202

O: 1-720-932-8028

Phoenix

1910 S. Stapley Drive,

Suite 221

Mesa, AZ 85204

O: 1-480-386-5057

New York

165 Broadway, 23rd Floor

New York City, NY 10006

O: 1-646-759-3656

Toronto

100 King Street West

Suite 5600

Toronto, Ontario, M5X 1C9

O: 1-416-479-5447

© 2017 ClearScale, LLC. All Rights Reserved.    About Us  |  Careers  |  Privacy Policy
Share