Azure Blob Storage with Pyspark

Azure Blob Storage is a highly scalable cloud storage solution from Microsoft Azure. It provides a cost-effective way to store and process massive amounts of unstructured data in the cloud. Azure Blob Storage can store any type of data, including text, binary, images, and video files, making it an ideal service for creating data warehouses or data lakes around it to store preprocessed or raw data for future analytics.

In this post, we will discuss how to access Azure Blob Storage using PySpark, a Python API for Apache Spark. Apache Spark is a fast and general-purpose cluster computing system that enables large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Before we dive into accessing Azure Blob Storage with PySpark, let's take a quick look at what makes Azure Blob Storage unique. Azure Blob Storage uses custom protocols, called wasb/wasbs, for accessing data from it. Windows Azure Storage Blob (wasb) is an extension built on top of the HDFS APIs, an abstraction that enables separation of storage. In order to access resources from Azure Blob Storage, you need to add the hadoop-azure.jar and azure-storage.jar files to your spark-submit command when you submit a job.

$ spark-submit --py-files src.zip \
            --master yarn \
            --deploy-mode=cluster \
            --jars hadoop-azure.jar,azure-storage.jar
            src/app.py

Alternatively, if you are using Docker or installing the application on a cluster, you can place the jars where PySpark can find them. The following commands download the required jar files and place them in the correct directory:

$ wget -nc -nv -O /usr/local/lib/python3.5/site-packages/pyspark/jars/azure-storage-2.2.0.jar http://central.maven.org/maven2/com/microsoft/azure/azure-storage/2.2.0/azure-storage-2.2.0.jar
$ wget -nc -nv -O /usr/local/lib/python3.5/site-packages/pyspark/jars/hadoop-azure-2.7.3.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.3/hadoop-azure-2.7.3.jar

Now that we have the necessary libraries in place, let's create a Spark Session, which is the entry point for the cluster resources in PySpark:

from pyspark.sql import SparkSession

session = SparkSession.builder.getOrCreate()

To access data from Azure Blob Storage, we need to set up an account access key or SAS token to your blob container:

session.conf.set(
    "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
    "<your-storage-account-access-key>"
)

or SAS token:

session.conf.set(
    "fs.azure.sas.<container-name>.blob.core.windows.net",
    "<sas-token>"
)

After setting up the Spark session and account key or SAS token, we can start reading and writing data from Azure Blob Storage using PySpark.

To read data from Azure Blob Storage, we can use the read method of the Spark session object, which returns a DataFrame. We need to specify the path to the data in the Azure Blob Storage account in the read method. The path should start with wasbs:// or wasb:// depending on whether we want to use the secure or non-secure protocol.

For example, to read a Parquet file from Azure Blob Storage, we can use the following code:

sdf = session.read.parquet(
    "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<prefix>"
)

Here, <container-name> is the name of the container in the Azure Blob Storage account, <storage-account-name> is the name of the storage account, and <prefix> is the optional path to the file or folder in the container. If the file or folder is in the root of the container, <prefix> can be omitted.

We can also write data to Azure Blob Storage using PySpark. To write data, we need to use the write method of the DataFrame object, which takes the path to write the data to in Azure Blob Storage.

For example, to write a DataFrame to a CSV file in Azure Blob Storage, we can use the following code:

df.write.csv(
    "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<prefix>"
)

We can also specify various options in the write method to control the format, compression, partitioning, etc. of the output data.

In addition to reading and writing data, we can also perform various operations on the data using PySpark. For example, we can use the PySpark SQL module to execute SQL queries on the data, or use the PySpark MLlib module to perform machine learning operations on the data.

Overall, Azure Blob Storage with PySpark is a powerful combination for building data pipelines and data analytics solutions in the cloud. With the ability to store and process large amounts of data in a scalable and cost-effective way, Azure Blob Storage and PySpark provide a powerful platform for building big data applications.

Additional materials

Processing Big Data with Azure HDInsight by Vinit Yadav