How to Drop a BigQuery Sharded Table

5 min read Last updated Jan 28 2025

#cloud #data-engineering

Sharded tables in BigQuery are awesome. They make querying huge datasets faster, cheaper, and way easier to manage. But here's the not-so-fun part: when you want to delete one, it's not as simple as bq rm. Ugh.

So, if you've ever stared at a table with hundreds of shards and thought, "There's gotta be a faster way," you're in the right place. This guide will walk you through dropping sharded tables in BigQuery using gcloud, bq, and some quick scripting hacks.

Sharding in BigQuery

BigQuery is Google's fully managed, serverless data warehouse that lets you run SQL queries on huge datasets in seconds. Built on distributed columnar storage and Dremel (Google's query engine), BigQuery is optimized for high-performance analytics at scale without requiring you to manage hardware, indexing, or maintenance.

BigQuery achieves its speed and scalability through a combination of distributed storage, columnar formats, and massively parallel processing (MPP). However, when dealing with petabyte-scale datasets, blindly scanning entire tables for every query would be inefficient and expensive. To optimize performance, BigQuery relies on sharding, a technique that logically segments data into smaller, more manageable pieces, reducing the amount of data scanned per query.

Sharding is BigQuery's way of splitting a large table into smaller, more manageable chunks. These chunks, called shards (or horizontal partitions if you like), are typically organized by time (e.g., daily or monthly). Instead of searching through the entire table, BigQuery can scan only the relevant shards, making queries faster and cheaper. Here's why it's a critical part of working with large datasets:

Performance Optimization: Since BigQuery physically stores shards as separate blocks of data, queries that filter on the shard column (e.g., WHERE event_date = '2024-01-01') will only scan the relevant shards instead of the entire table. This significantly reduces query execution time and resource consumption.
Cost Efficiency: BigQuery's pricing is based on the amount of data scanned. If a table spans multiple years but a query only needs data from the past week, sharding ensures that only the relevant subset is read, lowering query costs.
Data Management & Organization: Sharding provides a structured approach to handling large datasets, particularly for time-series data. Instead of a single, ever-growing table, data is logically divided into shards (e.g., daily or monthly), making it easier to manage retention policies, optimize queries, and improve overall maintainability.

But then there's the downside: deleting sharded tables is unnecessarily tricky. Unlike a plain old table, where you just run bq rm, sharded tables are like a collection of mini-tables. You have to delete each one manually — or script your way out of it.

Step-by-Step Guide

Before diving in, ensure you have gcloud installed and configured. You can download and install the Google Cloud SDK from the official documentation.

Step 1: Check and Set Your GCP Project

First, make sure you're working in the correct Google Cloud Project. Open your terminal and run:

gcloud config list

This command lists your current configuration and project. If you're not in the correct project, set the right one with:

gcloud config set project <your_project_id>

Replace <your_project_id> with the actual ID of your project.

Step 2: Automating Shard Deletion by Date Range

This bash script is designed to delete sharded tables in BigQuery for a specific date range.

#!/bin/bash

PROJECT_ID="<your project id>"
DATASET_ID="<your dataset>"
TARGET_TABLE="<your target table prefix, example: target_table_daily_>"
START_DATE="2025-01-01"
END_DATE="2025-02-01"
DRY_RUN=${DRY_RUN:-true}  # Set to 'false' to actually run

increment_date() {
    date -j -v +1d -f "%Y-%m-%d" "$1" +"%Y-%m-%d"
}

format_date() {
    date -j -f "%Y-%m-%d" "$1" +"%Y%m%d"
}

current_date="$START_DATE"

while [[ "$current_date" < "$END_DATE" ]]; do
    # Format the date as required by the target tables
    formatted_date=$(format_date "$current_date")

    # Construct the target table name
    target_table_name="${TARGET_TABLE}${formatted_date}"

    echo "Checking if table exists: ${target_table_name}"
    if bq show --format=none "${PROJECT_ID}:${DATASET_ID}.${target_table_name}" 2>/dev/null; then
        echo "Table exists."
        
        if [[ "$DRY_RUN" == "true" ]]; then
            echo "[DRY RUN] Would delete table: ${PROJECT_ID}:${DATASET_ID}.${target_table_name}"
        else
            echo "Deleting table: ${target_table_name}"
            bq rm -f -t "${PROJECT_ID}:${DATASET_ID}.${target_table_name}"
        fi
    else
        echo "Table does not exist."
    fi

    # Increment the date by one day
    current_date=$(increment_date "$current_date")
done

echo "Process completed! ${DRY_RUN:+(Dry Run Mode Enabled)}"

This script automates the deletion of sharded tables in BigQuery by iterating over a specified date range and checking if each shard exists before removing it. The key idea is that BigQuery stores shards as separate tables with a naming convention like table_prefix_YYYYMMDD, so we dynamically generate these names and process them one by one.

It starts by defining essential variables: the Google Cloud project, dataset, and table prefix, along with the start and end dates for the deletion process. A dry-run mode is enabled by default to prevent accidental deletions, requiring explicit confirmation before actually removing any tables. Remember, always double-check the shards you're about to delete to avoid any accidental data loss.

Feel free to drop any comments or questions below. If there's a better way you've found to handle this, I'm all ears — no one likes extra work, especially when it comes to managing data at scale.

Materials

BigQuery Partitioned Tables Documentation

Liked this? I publish one deep-dive every week.

Join 2,500+ engineers. No BS, no vendor fluff.

Get the newsletter