Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.
This series of posts is a single-stop resource that gives spark architecture overview and it's good for people looking to learn spark.
What is Spark?
Apache Spark is an open source cloud computing framework for batch and stream processing which was designed for fast in-memory data processing.
Spark is a framework and it mostly used on top of other systems. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. I will tell you about the most popular build — Spark on top of the Hadoop Yarn.
So before we go in depth of what the Apache Spark consists of, let's briefly understand the Hadoop platform and what YARN is doing there.
The OS analogy
In order to understand what Hadoop is, I will make an analogy with the operating system. The traditional operating system on a high-level consists of several parts: the file system and the processing component.
On a single machine, there is a file system, it can be different: FAT32, HPFS, ext2, NFS, ZFS, etc. It is basically a system for storing and managing data.
Also OS have a processing component: kernel, scheduler and some threads and process that's allowing us to run programs on the data.
When we move that concept of storage/processing parts to a cluster level and put that inside Hadoop we basically will get the same separation, the same two components. But the storage layer is instead of a single node file system will be HDFS — Hadoop distributed file system. And YARN(Yet Another Resource Negotiator) takes the role of the processing component: execution, scheduling, deciding what gets to run particular tasks and where.
Let's take a closer look at it.
So how the YARN works?
YARN has some kind of the main process — ResourceManager. The ResourceManager is the ultimate authority that arbitrates the division of resources among all the applications in the system. It has minions that are running on all cluster nodes which called NodeManagers.
All cluster nodes in the cluster have a certain number of containers. Containers are computational units, some kind of wrappers for node resources for performing user application's tasks. They are the main processing units which YARN manages. Containers have their own parameters that can be configured upon request(like ram, CPU, etc).
The containers on each node are controlled by the NodeManager daemon. When a new application is launching on a cluster, the ResourceManager allocates one container for the ApplicationMaster.
The per-application ApplicationMaster is a framework-specific entity and is tasked with negotiating for resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.
The ResourceManager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications subject to the familiar constraints of capacity, queues, and other factors.
After the ApplicationMaster is started, it will be responsible for the whole life cycle of the distributed application. First and foremost, it will be sending resource requests to the ResourceManager to ask for containers needed to run an application’s tasks. A resource request is simply a request for a number of containers that satisfies some resource requirements, such as:
- An amount of resources, today expressed as megabytes of memory and CPU shares
- A preferred location, specified by hostname, rackname, or
*to indicate no preference
- A priority within this application, and not across multiple applications
If the ApplicationMaster crashes or becomes unavailable, the ResourceManager can create another container and restart the ApplicationMaster on it.
The ResourceManager stores information about running applications and completed tasks in HDFS. If the ResourceManager is restarted, it recreates the state of applications and re-runs only incomplete tasks.
The ResourceManager, the NodeManager, and a container are not concerned about the type of application or task. All application framework-specific code is simply moved to its ApplicationMaster so that any distributed framework can be supported by YARN — as long as someone implements an appropriate ApplicationMaster for it. Thanks to this generic approach, the dream of a Hadoop YARN cluster running many various workloads comes true.
Armed with the knowledge from the previous sections, it will be useful to find out how the applications themselves work in YARN.
Let's go through the sequence of steps of application start:
A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself.
The ResourceManager gets responsibility for allocating the required container in which the ApplicationMaster will starts. Then ResourceManager launches the ApplicationMaster.
The ApplicationMaster registers itself on ResourceManager. Registration allows the client program to ask the ResourceManager for specific information which allows it to directly communicate with its own ApplicationMaster.
During normal operation, the ApplicationMaster requests the appropriate resource containers.
Upon successful receipt of the containers, the ApplicationMaster launches the container by providing the NodeManager with a container configuration.
Inside the container, the user application code starts. Then the user application provides information (stage of execution, status) to ApplicationMaster.
During the user application execution, the client communicates with the ApplicationMaster to obtain the status of the application.
When the application has completed execution and all the necessary work has been finished, the ApplicationMaster deregisters from ResourceManager and shuts down, freeing its container for other purposes.
Interesting facts and features
YARN offers several other great features. Describing all of them is outside the scope of the post, but I’ve included some noteworthy features:
- Uberization is the possibility to run all tasks of a MapReduce job in the ApplicationMaster’s JVM if the job is small enough. This way, you avoid the overhead of requesting containers from the ResourceManager and asking the NodeManagers to start (supposedly small) tasks.
- Simplified user-log management and access. Logs generated by applications are not left on individual slave nodes (as with MRv1) but are moved to central storage, such as HDFS. Later, they can be used for debugging purposes or for historical analyses to discover performance issues.