Spark History Server and monitoring jobs performance

Spark History Server and monitoring jobs performance

Imagine a situation when you wrote a Spark job to process a huge amount of data and it took 2 days to complete. It happens. In fact, it happens regularly.

To properly fine-tune these tasks, engineers need information. This information can be obtained from Spark Events (if you run something on a cluster, you can easily export it through the Spark UI and pass it on to the developers).

From the documentation:

Spark’s Standalone Mode cluster manager also has its own web UI. If an application has logged events over the course of its lifetime, then the Standalone master’s web UI will automatically re-render the application’s UI after the application has finished.

If Spark is run on Mesos or YARN, it is still possible to reconstruct the UI of a finished application through Spark’s history server, provided that the application’s event logs exist.

So what is the Spark history server?

The Spark history server allows us to view Spark application performance metrics, partitions, and execution plans at runtime or after the application have completed. By default, a Spark installation comes with built-in scripts: sbin/start-history-server.sh and sbin/stop-history-server.sh. But by default Spark is not configured with the History server enabled. We must configure it ourselves by changing the configuration. Even though it's called a server, it's actually easiest to open it locally, especially if Spark Events can only be obtained from the DevOps team.

The following configuration settings must be set:

  • spark.eventLog.enabled: It must be enabled; this property is used to restore the web interface after the application is completed.
  • spark.eventLog.dir: This is the directory where the application event log information will be stored. This may be the path in HDFS starting with hdfs://. It can also be a local path starting with file://; the default value is file:///tmp/spark-events. The directory must be created in advance.
  • spark.history.fs.logDirectory: This property is called by the history daemon. It is set on the server-side, and its value must be the same as spark.eventLog.dir. The value can be the HDFS path starting with hdfs://, a local path starting with file://; the default value is file:///tmp/spark-events. The directory must be created in advance.
  • (Optional) spark.eventLog.compress: Defines whether or not to compress events in the Spark event log. Snappy is used as the default compression algorithm.

All history server configurations can be set in $SPARK_HOME/spark-defaults.conf (if you don't have it, just remove the .template suffix from $SPARK_HOME/spark-defaults.conf.template) as described below.

For example:

spark.eventLog.enabled  true
spark.eventLog.dir  file:///usr/lib/spark/logs
spark.history.fs.logDirectory   file:///usr/lib/spark/logs

Then you are ready to start the Spark History service:

./$SPARK_HOME/sbin/start-history-server.sh

To stop the Spark History service, do

./$SPARK_HOME/sbin/stop-history-server.sh

Now you have history server working on the following URL(use the URL from your terminal output) — http://localhost:18080


Buy me a coffee