success story

Powering data analytics at scale through Amazon Elastic MapReduce (EMR)

Actionable insights derived from big data are an integral part of our client’s business operations. As a global leader in providing insights and analytics, it delivers critical data, information, workflow solutions, and deep domain expertise to accelerate innovation. 

The Challenge

The incumbent Qubole-based setup limited the client’s ability to scale its data analytics capabilities to cope with increasing data traffic. The system had become unresponsive and didn’t provide the client with enough agility to create a differentiation in a crowded marketplace of data and insights services. As a response, the client was compelled to drive down costs to be more competitive. 

The client sought to alleviate capacity, flexibility, and agility challenges by implementing a new system. They also needed to optimize costs while not compromising on efficiencies. Other key reasons to adopt Amazon EMR were:

  • The Qubole service cost per node was higher than the Amazon EMR service cost

  • Qubole provided limited free user accounts and charges for additional user accounts. This increased the overall service charge

  • The client wanted to reduce dependency on multiple vendors and consolidate its big data under one provider

  • Platform stability issues were reported due to frequent version updates
The Solution

Amazon EMR is known for its ability to analyze vast data sets quickly and enhance data management efficiency. Virtusa helped replace Qubole with Amazon EMR as the client was an existing AWS customer. Running Amazon EMR would help it control costs and fully maximize ROI.

Relevant features of the deployment included:

  • Separate accounts for development and production, and separate Bastion access to every AWS account, including a separate private key to access Amazon EC2 nodes Additionally, Virtusa implemented

  • AWS Identify and Access Management role-based access to edge nodes, clusters, and Amazon S3 resources defined security groups based on the project requirements, Single Sign-On, and Multi-factor Authentication access to the AWS console

  • The Amazon EMR workflow included automaticAutomatic cluster launch and termination through scripts in the batch processing workflow

  • Since some of the big data jobs used HDFS for intermediate data storage, we deployed Amazon Relational Database Service ( Amazon RDS) for Oracle for DB procedure. The final output data was put in Amazon S3 buckets, and application of Amazon S3 bucket life cycle rules were applied

  • Automated management and deployment of clusters for the customer were carried out through AWS CLI with shell scripting. Bootstrapping was used to copy certain application-specific files into all data nodes

AWS Services Used:

  • Amazon EMR
  •  Amazon EC2
  • Amazon S3
  • AWS IAM
  • Amazon RDS
Amazon Elasticsearch for Data Analytics
The Benefit

With Amazon EMR, the client has access to the scale it requires to find meaningful connections with data and deliver insights to its customers.

Harnessing the synergy within the AWS ecosystem, the company no longer has to plan for downtimes to deploy enhancements or features, especially given that Amazon EMR is a managed service. Amazon EMR gave better control over costs, as pricing with Amazon EMR is straightforward and predictable - with a pay-per-instance model. Other key benefits include:

  • Reduced the cost of cloud computing significantly and helped save $200,000 annually along with improving platform stability

  • Enhanced the security by separating accounts under AWS for development and production, role-based access, and security groups
Amazon Elastic MapReduce (EMR)

Easily run and scale your big data frameworks

Related content