EMR Serverless a 400-level guide

#data-engineering #devops #cloud

Who didn't catch the title, level 400 in AWS terms means "expert", but here it's just a catchy title and nothing more 🙂

AWS EMR is a cloud-based big data platform used by engineers to perform large-scale distributed data processing, interactive SQL queries, and other analytical tasks using open-source analytics platforms such as Apache Spark, Apache Hive, and Presto. If you're a data engineer or data analyst, I'm sure you've worked with EMR at some point in your career.

Using EMR, you have full control over cluster configuration. The ability to customize clusters allows you to optimize cost and performance based on your specific workload requirements. However, configuring clusters to achieve optimal cost and performance requires engineers to have an in-depth knowledge of underlying analytical platforms and frameworks. In addition, the specific computing and memory resources required for optimal application performance depend on many factors, such as processing task times and the volume of data to be processed, and, in addition, these characteristics change over time. Many teams don't need such a level of control.

EMR Serverless provides a simpler solution, saving an engineer’s time from having to manage these configurations.

With EMR Serverless, you can get all the benefits of working with EMR, but in a serverless environment. You simply choose the framework you want to use for your application and submit jobs using the API. You don't need to worry about configuring anything. EMR Serverless automatically allocates and scales the compute and memory resources needed to perform the task, and users pay only for the resources they use. Currently, the frameworks you can use in EMR Serverless are limited to Apache Spark and Apache Hive.

Difference EMR deployemnts

Let’s deep dive a little bit into how it works.

Concepts

EMR Serverless uses several core concepts:

Application. EMR Serverless works on the concept of Application (similar to running an EKS cluster). Once an application is initiated, it can process multiple jobs and dynamically allocate workers based on job needs. You can also set limits to control and track usage costs incurred by the application. Application concept is needed to maintain separate logical environments, separate teams, or applications.
Job run. A job run is a request submitted to an EMR Serverless application that is asynchronously executed and tracked through completion. EMR Serverless starts executing jobs as soon as they are received and runs multiple job requests concurrently. A submitted job is automatically run in the best availability zone.
Workers. An EMR Serverless application internally uses workers to execute workloads. The default size of these workers is based on your application type and release version (there are not many of them at the moment). You can override these sizes when scheduling a job run.
EMR Serverless automatically scales workers up or down depending on the workload and concurrency required at every stage of the job, removing the need for engineers to estimate the number of workers required to run your workloads. This is actually their killer feature.
Pre-initialized workers. EMR Serverless provides an optional feature to pre-initialize workers when your application starts up, so that the workers are ready to process requests immediately when a job is submitted to the application. Pre-initialized workers allow for maintaining a warm pool of workers for the application so that it can provide a sub-second response to start processing requests. That’s documentation wording, but I had a different experience, unfortunately.

EMR Serverless features

EMR Serverless automatically prepares and configures resources as needed, so you don't have to think ahead about infrastructure when data volumes change over time. It automatically calculates and allocates the compute and memory resources needed to process requests and scales them at different stages of processing (you can think about it as Spark or Tez stages). It scales up or down depending on changing job requirements. For example, a Spark job may require two executor nodes in the first 5 minutes, ten executors in the next 10 minutes, and five executors in the last 20 minutes, depending on the type of data processing. Cool, huh?!

For scaling workers, EMR Serverless uses Spark’s Dynamic Resource Allocation feature, which means that you can’t use that for Spark Streaming as it doesn’t support Dynamic Resource Allocation at time of writing.

In EMR Serverless, workers are assigned to each job as needed, so each job gets the resources it needs. Moreover, because you only pay for the workers your jobs use, you don't incur costs for redundant resources. You pay for the amount of vCPU, memory, and storage resources consumed by your applications. When you think about it that way, using this service makes it very difficult to over- or under-provision resources.

Also, while using EMR Serverless you don't have to worry about instance selection, cluster size, optimizations, starting a cluster, resizing a cluster, node outages, availability zone failover, or OS upgrades.

Finally, because each job can specify the IAM role that should be used to access AWS resources while the job is running, you don't need to create complex configurations to manage queues and permissions.

As for the price, based on tests and some calculations I came to the conclusion that for short-term jobs EMR Serverless is cheaper compared to Glue and EMR(EC2). For long-running jobs, there is a runtime point at which a regular EMR(EC2) is more cost-effective than a serverless one.

Glue vs EMR serverless

You may wonder why we need EMR Serverless when we already have a serverless solution with AWS Glue. I'll give you my two cents about this because I've been wondering the same thing.

Glue

As per AWS documentation, AWS Glue is "Simple, scalable, and serverless data integration". Glue can be used for a variety of things: as a metadata repository, automatic schema discovery, code generation, and running ETL pipelines to prepare data. Glue takes care of providing and managing the computation resources needed to run your data pipelines. Glue is a serverless service, so you don't need to create and manage the infrastructure, because Glue does it for you.

If we focus only on the processing feature and discard the Glue-specific features (schema discovery, code generation, etc) then EMR Serverless and Glue services look almost identical. One of the key advantages of both services is the ability to run Spark or Hive serverless applications.

What advantage will EMR Serverless have over Glue Spark jobs?

To run Glue, you must either specify MaxCapacity (for Glue version 1.0 or earlier jobs) or Worker type and the Number of workers (for Glue version 2.0 jobs). Both options assume, first, that there is some understanding of the data and workload per cluster, and second, that the workload during job execution will be uniform, i.e., there will be no over- or under-utilization of the provisioned resources.

EMR Serverless

EMR Serverless is a new deployment option for AWS EMR. With EMR Serverless, you don't need to configure, optimize, protect, or manage clusters to run applications on these platforms. EMR Serverless helps you avoid over- or under-allocation of resources to process jobs at the individual stage level.

EMR Serverless automatically identifies the resources needed by jobs, provisions those resources to run the jobs, and releases them when the jobs are completed. In cases where applications require a response within seconds, such as interactive data analysis, the engineer can pre-initialize the necessary resources during application creation. This provides easy initialization, fast job startup, automatic capacity management, and simple cost control.

Demo

These demos assume you are using an Administrator-level role in your AWS account

This simple tutorial will help you get started using EMR Serverless by deploying a simple Spark application.

Prerequisites

Before you launch an EMR Serverless application, make sure you complete the following tasks:

As of this writing, EMR Serverless is still in preview release. Go here to sign up for the pre-release.
In preview you can interact with EMR Serverless using AWS CLI to create applications and run jobs. However, AWS CLI does not have an EMR Serverless service model as of now. You can update the service model independently from the CLI, if the update isn't yet available. Once you've received confirmation of access, use the following command to download the latest API model file and update the AWS CLI by running:

aws s3 cp s3://elasticmapreduce/emr-serverless-preview/artifacts/latest/dev/cli/service.json ./service.json
aws configure add-model --service-model file://service.json

As of this writing (May, 2022), a preview version of Amazon EMR Serverless is available only in the US-East (N Virginia) region so we must configure the default region to be us-east-1:

aws configure set region us-east-1

The last thing that needs to be done is creation of an S3 bucket. In this tutorial, you'll use an s3 bucket to store scripts, data, output files, and logs. So the s3 bucket must be created in the same Region where EMR Serverless is available (us-east-1). Set an environment variable to your bucket name:

BUCKET_NAME=<bucket name>

Once everything is done, check that you have access to the service and everything is working correctly:

aws emr-serverless list-applications

You should get an empty list of applications.

Creating IAM policy and role

EMR Serverless uses IAM roles to run jobs that grant granular permissions to resources at runtime, as well as to our newly created S3 bucket. Multiple job run requests can be submitted to an application, and each job run can use a different execution role to access AWS resources.

To configure a job runtime role, first create a runtime role with a trust policy to allow EMR Serverless to use the new role. You will then attach the required S3 access policy to this role. For ease of use, we will use IaC Cloudformation templates with sceptre — it simplifies the work with CloudFormation templates:

AWSTemplateFormatVersion: "2010-09-09"
Description: This template creates a s3 bucket
Parameters:
  RoleName:
    Type: String
    Description: Name of Role
    Default: sampleJobRuntimeRole
  BucketName:
    Type: String
    Description: The Bucket write to be done.
Resources:
  EMRServerlessExecutionRole:
    Type: "AWS::IAM::Role"
    Properties:
      RoleName: !Ref RoleName
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Sid: "EMRServerlessTrustPolicy"
            Effect: Allow
            Principal:
              Service:
                - emr_serverless.amazonaws.com
            Action: "sts:AssumeRole"

  EMRServerlessExecutionPolicy:
    Type: "AWS::IAM::Policy"
    Properties:
      PolicyName: EMRServerlessExamplePolicy
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - "glue:GetDatabase"
              - "glue:GetDatabases"
              - "glue:CreateDatabase"
              - "glue:GetTable"
              - "glue:GetTables"
              - "glue:CreateTable"
              - "glue:DeleteTable"
              - "glue:UpdateTable"
              - "glue:GetUserDefinedFunctions"
              - "glue:GetPartition"
              - "glue:GetPartitions"
              - "glue:CreatePartition"
              - "glue:DeletePartition"
              - "glue:BatchCreatePartition"
              - "glue:BatchDeletePartition"
            Resource: "*"
          - Effect: Allow
            Action:
              - "s3:GetObject"
              - "s3:ListBucket"
            Resource:
              - "arn:aws:s3:::*.elasticmapreduce"
              - "arn:aws:s3:::*.elasticmapreduce/*"
          - Effect: Allow
            Action:
              - "s3:PutObject"
              - "s3:GetObject"
              - "s3:ListBucket"
              - "s3:DeleteObject"
            Resource:
              - !Sub "arn:aws:s3:::${TargetBucket}"
              - !Sub "arn:aws:s3:::${TargetBucket}/*"
          - Effect: Allow
            Action:
              - sts:AssumeRole
            Resource: !Sub arn:aws:iam::${AWS::AccountId}:role/*
          - Effect: Allow
            Action:
              - iam:PassRole
            Resource:
              - arn:aws:iam::*:role/*
      Roles:
        - !Ref EMRServerlessExecutionRole

Outputs:
  Role:
    Value: !GetAtt EMRServerlessExecutionRole.Arn

After it’s done, we need to create a configuration:

template_path: "emr-serverless.yaml"
parameters:
  RoleName: "sampleJobRuntimeRole"
  BucketName: "<bucket name>"

And finally deployment:

sceptre launch --yes emr-serverless.yaml

Set an environment variable to your role arn:

ROLE_ARN=$(aws iam get-role --role-name sampleJobRuntimeRole --query Role.Arn --output text)`

We are writing everything into the env variables for reusability and simpler usage.

Creating application

EMR Serverless creates workers to accommodate requested jobs. By default, each application uses 3 executors with 4 vCPU, 14 GB of memory, and 21 GB of local storage to run your workloads. Just for reference: EMR Serverless Spark defaults. You have the ability to customize this configuration to control the performance of your applications.

You can also specify a pre-initialized capacity by setting the initialCapacity parameter while creating the application. You may also choose to set a limit for the total maximum capacity that an application can use by setting the maximumCapacity parameter.

When a job starts, the resources from initialCapacity are used to start the job if workers are available. If workers are not available because other jobs are not using them, or if more jobs are needed than are available and there are not enough resources, then the maximum number of additional workers is automatically requested and acquired based on the resources set for the application.

When jobs finish, the workers used by the job are released by returning initialCapacity to the number of resources available to the application.

We will be using all of it here:

aws emr-serverless create-application --release-label emr-6.5.0-preview \
    --type "SPARK" \
    --name test-app \
    --initial-capacity '{
        "DRIVER": {
            "workerCount": 1,
            "resourceConfiguration": {
                "cpu": "2vCPU",
                "memory": "4GB"
            }
        },
        "EXECUTOR": {
            "workerCount": 1,
            "resourceConfiguration": {
                "cpu": "2vCPU",
                "memory": "4GB"
            }
        }
      }'  \
    --maximum-capacity '{
        "cpu": "10vCPU",
        "memory": "1024GB"
      }'

Pay attention to the application ID returned in the output because you will use it to start the application. So let's store it in the environment variable APPLICATION_ID:

APPLICATION_ID=$(aws emr-serverless list-applications --query "applications[0].id" --output text)

To check the state of your application, run the following command:

aws emr-serverless get-application --application-id $APPLICATION_ID

When application has reached the CREATED state, start your application using the following command:

aws emr-serverless start-application --application-id $APPLICATION_ID

Before you can schedule a job using your application, you must start the application. Make sure your application has reached the STARTED state using the same get-application API call:

aws emr-serverless get-application --application-id $APPLICATION_ID

Simple batch job

After you create and start your application, you can schedule data processing jobs or interactive requests.

As the simplest example, we will use a simple word-count job to test a batch job:

aws emr-serverless start-job-run \
    --application-id $APPLICATION_ID \
    --execution-role-arn $ROLE_ARN \
    --job-driver '{
        "sparkSubmit": {
          "entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py",
          "entryPointArguments": ["s3://'$BUCKET_NAME'/batch_target"],
          "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
          "s3MonitoringConfiguration": {
            "logUri": "s3://'$BUCKET_NAME'/logs"
           }
        }
    }'

We can leave sparkSubmitParameters empty — without configuring memory or cores, and that would be fine too. Just for the sake of this tutorial, I wanted to show you that you can provide any configuration as you can with a normal EMR job.

Get a job ID:

aws emr-serverless list-job-runs --application-id $APPLICATION_ID
JOB_RUN_ID=$(aws emr-serverless list-job-runs --application-id $APPLICATION_ID --query "jobRuns[0].id" --output text)

Getting the status of the job:

aws emr-serverless get-job-run --application-id $APPLICATION_ID --job-run-id $JOB_RUN_ID

Once finished it should be in the SUCCESS state and store Spark logs in the provided S3 bucket

Tear down everything

Remember to tear everything down when the testing is done:

aws emr-serverless stop-application --application-id $APPLICATION_ID
aws emr-serverless delete-application --application-id $APPLICATION_ID
sceptre delete -y emr-serverless.yaml

Critique

I played around with this service for a while and found several issues along the way when using EMR Serverless.

🧐 There are not many logs from Spark that I could check. Despite the fact that I configured the location for the logs, they are not real Spark logs, Spark events, executor logs, and so on, as a result, it is very difficult to debug an application. I was able to run a fairly complex job on a local server via spark-submit but was not able to do the same on EMR Serverless and most importantly I don’t have the tools to understand the problem.

🧐 Scheduling the jobs is taking time. Even with warm workers, it does not execute jobs right away and I was forced to wait some time (2-5 min) which seems very slow.

🧐 At the moment you can interact with EMR Serverless only through AWS CLI, I have no doubt that the service will be added to CloudFormation and CDK but we are not there yet unfortunately

🧐 It's hard for me to understand how well EMR Serverless is doing its job because I haven't figured out how to monitor these jobs or at least look at history server logs.

Conclusion

Adapting to a serverless architecture can be very efficient for certain use cases of your business. AWS EMR Serverless is still under preview and will be released soon. I hope this will take vast amounts of data processing on the cloud to another level.

What I see has a lot of potential, but right now EMR Serverless is still not ready for production deployment, and not sure when it will be released.

Check out AWS talk about EMR Serverless: