Apache Spark is considered a powerful complement to Hadoop, the original big data technology. Apache Spark is a more accessible, powerful and capable tool for solving various tasks related to big data. It has become the mainstream and the most demanded framework for big data in all major industries. Since 2.0 Spark has become part of Hadoop. And it is one of the most useful technologies for Python Big Data Engineers.

This series of articles is a single resource that gives an overview of Spark architecture and is useful for people who want to learn how to work with Spark.

Whole series:


What is Spark?

Apache Spark is an open-source cloud computing framework for batch and stream processing which was designed for fast in-memory data processing.

Spark is framework and is mainly used on top of other systems. You can run Spark using its standalone cluster mode on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. I will tell you about the most popular build — Spark with Hadoop Yarn.

So, before we go deeper into Apache Spark, let's take a quick look at the Hadoop platform and what YARN does there.

The OS analogy

To understand what Hadoop is, I will draw an analogy with the operating system. The traditional operating system at a high level consists of several parts: a file system and a processing component.

On one machine there is file system, it can be different: FAT32, HPFS, ext2, NFS, ZFS, etc. The operating system can be different. It is a system for data storage and management.

Classic OS

In addition, the OS has a processor component: a kernel, a scheduler, and some threads and a process that allows programs to run on data.

Hadoop

When we move this storage/processing concept to the cluster level and place it inside Hadoop, we get basically the same separation, the same two components. But the storage layer instead of the single-node file system will be HDFS — Hadoop distributed file system. And YARN (Yet Another Resource Negotiator) takes on the role of the processing component: execution, scheduling, deciding what can be done and where.

Let us take a closer look.

YARN architecture

YARN

So how the YARN works?

YARN has some basic process — ResourceManager. ResourceManager is the ultimate authority that decides the allocation of resources between all applications in the system. It has minions that run on all nodes of the cluster called NodeManager. Also, ResourceManager has a plug-in component scheduler that is responsible for allocating resources for various running applications, taking into account familiar limitations on capacity, queues and other factors.

YARN NodeManagers

All nodes of the cluster have a certain number of containers. Containers are computing units, a kind of wrappers for node resources to perform tasks of a user application. They are the main computing units that are managed by YARN. Containers have their own parameters that can be configured on-demand (e.g. ram, CPU, etc.).

YARN Containers

Containers on each node are controlled by NodeManager daemon. When launching a new application on a cluster, ResourceManager allocates one container for ApplicationMaster. ApplicationMaster for each application is a framework-specific entity that is tasked with negotiating resources with ResourceManager and working with NodeManager(s) to perform and monitor component tasks.

Once launched, ApplicationMaster will be responsible for the entire lifecycle of the distributed application. First of all, it will send resource requests to ResourceManager to get the containers needed to perform the application tasks. A resource request is simply a request for a number of containers that meets some resource requirements, for example:

  • A number of resources, expressed in megabytes of memory and processor shares
  • Preferred container location, indicated by hostname, rackname, or * to indicate no preference
  • Priority within the application, not for multiple applications.

The ApplicationMaster is run in a container like any other application. If ApplicationMaster crashes or becomes unavailable, ResourceManager can create another container and restart ApplicationMaster on it. It makes the whole application highly available. ResourceManager stores information about running applications and tasks performed in HDFS. When restarted, ResourceManager recreates the state of applications and restarts only tasks that have not been completed.

ResourceManager, NodeManager and the container do not care about the type of application or task. All application framework code is simply transferred to the ApplicationMaster so that any distributed framework can be supported by YARN — as long as someone implements a suitable ApplicationMaster for it. With this common approach, the dream of a Hadoop YARN cluster with many various workloads comes true.

Submit application

Equipped with knowledge from previous sections, it will be useful to learn how applications themselves work in YARN.

Let's go through a sequence of steps of the application launch:

  1. A client program submits the application, including the necessary specifications to run the application-specific ApplicationMaster.

  2. ResourceManager gets responsibility for the allocation of a necessary container in which ApplicationMaster will be started. Then ResourceManager starts ApplicationMaster.

  3. ApplicationMaster registers itself in ResourceManager. Registration allows the Customer program to request specific information from ResourceManager that allows it to directly interact with its ApplicationMaster.

  4. In normal operation ApplicationMaster asks for suitable containers from ResourceManager for the application to run.

  5. After successfully receiving the containers, ApplicationMaster launches them, providing NodeManager(s) their configurations.

  6. Inside the containers, it runs the user application code. The NodeManager(s) then provides the information (execution phase, status) for ApplicationMaster.

  7. During the runtime of the user application, the client interacts with ApplicationMaster to obtain the application status.

  8. When the application completes and all necessary work is completed, ApplicationMaster deregisters from ResourceManager and terminates, releasing the container for other purposes.

Interesting facts and features

YARN offers a number of other great features. It is beyond the scope of this post to describe them all, but I have included some noteworthy features:

  • Uberization is the ability to run all MapReduce tasks in ApplicationMaster's JVM if the tasks are small enough. This way, you avoid the overhead associated with requesting containers from ResourceManager and asking NodeManagers to run (presumably small) tasks.
  • Simplified management and access to application log files. Application-generated logs do not remain on individual slave nodes (as in MRv1) but are moved to a central repository, such as HDFS. They can later be used for debugging or for historical analysis to detect performance problems.

Next post

Recommended books

Last updated Thu Apr 16 2020