Azure Blob Storage with Pyspark

Azure Blob Storage with Pyspark

Azure Blob Storage is a service for storing large amounts of data stored in any format or binary data. This is a good service for creating data warehouses or data lakes around it to store preprocessed or raw data for future analytics.

In this post, I'll explain how to access Azure Blob Storage using Spark framework on Python.


Azure blob requires additional libraries for accessing data from it, because it uses custom wasb/wasbs protocol. Windows Azure Storage Blob (wasb) is an extension built on top of the HDFS APIs, an abstraction that enables separation of storage. In order to access resources from Azure blob you need to add jar files hadoop-azure.jar and azure-storage.jar to spark-submit command when you submitting a job.

$ spark-submit --py-files src.zip \
			--master yarn \
			--deploy-mode=cluster \
			--jars hadoop-azure.jar,azure-storage.jar
			src/app.py

Also, if you are using Docker or installing the application on a cluster, there is a tip for you as well. You do not need to pass additional jars if they are already in the right place — where pyspark can find them (this, of course, should be done on every node of the cluster). Use the commands below with caution — versions/links may change!

$ wget -nc -nv -O /usr/local/lib/python3.5/site-packages/pyspark/jars/azure-storage-2.2.0.jar http://central.maven.org/maven2/com/microsoft/azure/azure-storage/2.2.0/azure-storage-2.2.0.jar
$ wget -nc -nv -O /usr/local/lib/python3.5/site-packages/pyspark/jars/hadoop-azure-2.7.3.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.3/hadoop-azure-2.7.3.jar

On the application level, first of all as always in spark applications, you need to grab a Spark Session. Spark Session is the entry point for the cluster resources — for reading data and execute SQL queries over data and getting the results.

session = SparkSession.builder.getOrCreate()

Then set up an account key to your blob container:

session.conf.set(
	"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
	"<your-storage-account-access-key>"
)

or SAS token:

session.conf.set(
	"fs.azure.sas.<container-name>.blob.core.windows.net",
	"<sas-token>"
)

Once an account access key or SAS token is set up you're ready to read/write to Azure blob:

sdf = session.read.parquet(
	"wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<prefix>"
)
sdf.show()

Daily dose of