Home / Spark History Server and monitoring jobs performance

Spark History Server and monitoring jobs performance

Spark History Server and monitoring jobs performance

Imagine a situation that you wrote a spark job for processing huge amount of data and it took 2 days for this job to finish. It happens. Actually, it happens regularly.

To tune these jobs engineers need information. It can be obtained from spark events(if you run something on a cluster in Spark UI you can easily export them and handle to developers).

From docs:

Spark’s Standalone Mode cluster manager also has its own web UI. If an application has logged events over the course of its lifetime, then the Standalone master’s web UI will automatically re-render the application’s UI after the application has finished.

If Spark is run on Mesos or YARN, it is still possible to reconstruct the UI of a finished application through Spark’s history server, provided that the application’s event logs exist.

So what is the Spark history server?

The Spark History server allows us to review Spark application performance metrics and execution plans after the application has completed. The default Spark installation comes with built-in scripts: sbin/start-history-server.sh and sbin/stop-history-server.sh. But Spark is not configured for the History server by default. We should configure it yourself by changing configuration.

The following configurations settings should be set:

  • spark.eventLog.enabled: Must be enabled; this property is used to reconstruct the web UI after the application has completed.
  • spark.eventLog.dir: This is the directory where event log information of an application is saved. It can be an HDFS path schema beginning with hdfs://. The value can also be a local path beginning with file://; the default value is file:///tmp/spark-events. The directory should be created in advance.
  • spark.history.fs.logDirectory: This property is called by the History daemon. It is set on the server-side and the value should be the same as the value of spark.eventLog.dir. The value can be an HDFS path schema beginning with hdfs://. The value can also be a local path beginning with file://; the default value is file:///tmp/spark-events. The directory should be created in advance.
  • (Optional) spark.eventLog.compress: Specifies whether to compress Spark logged events. Snappy is used as the default compression algorithm.

All history server configurations should be set in the $SPARK_HOME/spark-defaults.conf file (if you don't have one then just remove .template suffix from $SPARK_HOME/spark-defaults.conf.template) as described below.

For example:

spark.eventLog.enabled  true
spark.eventLog.dir  file:///usr/lib/spark/logs
spark.history.fs.logDirectory   file:///usr/lib/spark/logs

Then you are ready to start the Spark History service:

./$SPARK_HOME/sbin/start-history-server.sh

To stop the Spark History service, do

./$SPARK_HOME/sbin/stop-history-server.sh

Now you have history server working on the following URL(use the URL from your command output) — http://localhost:18080

Support author