Chapter 8: “Scaling SQL and Exploring Pipelines”

As your dataset begins to grow into thousands of rows to a billion rows, traditional SQL query approaches begin to get slower. Even a poorly placed index will make your query very expensive to run. In order to combat this there are tools that you will need in order to handle massive amounts of data, run queries faster using parallel processing, and automate workflows for real-time and batch updates. In this chapter we will go over how modern data systems scale using parallel computing, ETL pipelines, and cloud-native architecture. at a high level.

Parallel Processing

Most modern databases (like PostgreSQL, Snowflake, BigQuery, Redshift) can execute queries using parallel processing. That means splitting your query into smaller pieces and running them simultaneously.

Benefits of Parallelism:

Faster aggregations on large datasets
Lower latency for dashboard queries
Improved concurrency

SQL Features that use Parallel Features:

GROUP BY and JOIN operations
Aggregate functions like SUM(), AVG()
Sort and filter stages

Distributed Computing Architectures

Sometimes a single machine isn't enough. Systems like:

Google BigQuery (serverless MPP)
Amazon Redshift (clustered nodes)
Apache Spark (in-memory distributed compute)

...use Massively Parallel Processing (MPP) and distributed file systems to scale horizontally.

There are multiple ways to scale horizontally such as:

Sharding: splits data across multiple nodes
Replication: copies data for fault tolerance
Partitioning: divides data by a specific range or hash

The most common way to scale data horizontally is by partitioning and there are three main types of partitioning: round-robin, hash, and range based partitioning, each of which has their own advantages and disadvantages. We will mainly be mentioning skew, which can be thought of as uneven data distribution.

Range Based: Partition based on ranges of value in your data — susceptible to skew if some ranges are more popular
Hash Based: Partition based on defined hash function — a little susceptible to skew if certain values dominate
Round Robin: Evenly distributes data — not susceptible to skew as each piece is evenly spread across processors

ETL vs ELT Pipelines

Data pipelines move data from source systems to target systems (e.g., databases, dashboards).

ETL: Extract → Transform → Load

Common in traditional warehouses
Clean and transform before loading
Example tools: Talend, Informatica

ELT: Extract → Load → Transform

Common in cloud warehouses (e.g., BigQuery, Snowflake)
Load raw data, transform with SQL
Example tools: dbt, SQLMesh, Dataform

Sample Workflow: CSV Files → Cloud Storage → BigQuery → dbt → Dashboard

Streaming vs Batch Pipelines

Batch: Load entire datasets periodically (hourly, daily)

Streaming: Ingest small updates in near real-time (Kafka, Flink)

Choose batch when:

You can tolerate delay
Data is static between intervals

Choose streaming when:

You need low-latency alerts
Data changes rapidly (e.g., sensors, apps)

Some Real World Examples of Cloud-Native Data Warehouses Are:

BigQuery: Serverless, ANSI SQL, parallel query, Fast analytics at scale
Snowflake: Auto-scaling, time travel, ELT workflows, Collaborative analytics
Redshift: Cluster-based MPP, Deep AWS integration

These platforms support:

Seamless scaling
Data partitioning & clustering
Integration with dbt and BI tools

SQL in the Data Pipeline Ecosystem

SQL plays a central role in ELT stacks:

Transformation: dbt models written in SQL
Validation: assertions, tests, constraints
Materialization: views, tables, incremental loads

Next: Chapter 9 →