Home
Azure Blob Storage with Pyspark

Azure Blob Storage is a service for storing large amounts of unstructured object data, such as text or binary data. You can use Blob Storage to expose data publicly to the world or to store application data privately.

In this post, I'll explain how to access Azure Blob Storage using spark framework on Python.

So, imagine that you already have an Azure storage account, you have data there.

Azure blob requires to install additional libraries for accessing data from it, because it uses wasb/wasbs protocol, not standard tcp/http/etc. The built jar files, named hadoop-azure.jar and azure-storage.jar need to be add to spark-submit when you submitting a job.

On the application level, first of all as always in spark applications, you need to grab a spark session:

session = SparkSession.builder.getOrCreate()

Then you need to set up an account key:

session.conf.set(
    "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
    "<your-storage-account-access-key>"
)

 OR SAS token for a container:

session.conf.set(
    "fs.azure.sas.<container-name>.blob.core.windows.net",
    "<sas-token>"
)

Once an account access key or a SAS is set up you're ready to read/write to Azure blob:

sdf = session.read.parquet(
    "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<prefix>"
)