Chapter 8: “Scaling SQL and Exploring Pipelines”

As your dataset begins to grow into thousands of rows to a billion rows, traditional SQL query approaches begin to get slower. Even a poorly placed index will make your query very expensive to run. In order to combat this there are tools that you will need in order to handle massive amounts of data, run queries faster using parallel processing, and automate workflows for real-time and batch updates. In this chapter we will go over how modern data systems scale using parallel computing, ETL pipelines, and cloud-native architecture. at a high level.

Parallel Processing

Most modern databases (like PostgreSQL, Snowflake, BigQuery, Redshift) can execute queries using parallel processing. That means splitting your query into smaller pieces and running them simultaneously.

Benefits of Parallelism:

  • Faster aggregations on large datasets
  • Lower latency for dashboard queries
  • Improved concurrency

SQL Features that use Parallel Features:

  • GROUP BY and JOIN operations
  • Aggregate functions like SUM(), AVG()
  • Sort and filter stages

Distributed Computing Architectures

Sometimes a single machine isn't enough. Systems like:

  • Google BigQuery (serverless MPP)
  • Amazon Redshift (clustered nodes)
  • Apache Spark (in-memory distributed compute)

...use Massively Parallel Processing (MPP) and distributed file systems to scale horizontally.

There are multiple ways to scale horizontally such as:

  • Sharding: splits data across multiple nodes
  • Replication: copies data for fault tolerance
  • Partitioning: divides data by a specific range or hash

The most common way to scale data horizontally is by partitioning and there are three main types of partitioning: round-robin, hash, and range based partitioning, each of which has their own advantages and disadvantages. We will mainly be mentioning skew, which can be thought of as uneven data distribution.

  • Range Based: Partition based on ranges of value in your data — susceptible to skew if some ranges are more popular
  • Hash Based: Partition based on defined hash function — a little susceptible to skew if certain values dominate
  • Round Robin: Evenly distributes data — not susceptible to skew as each piece is evenly spread across processors

ETL vs ELT Pipelines

Data pipelines move data from source systems to target systems (e.g., databases, dashboards).

ETL: Extract → Transform → Load

  • Common in traditional warehouses
  • Clean and transform before loading
  • Example tools: Talend, Informatica

ELT: Extract → Load → Transform

  • Common in cloud warehouses (e.g., BigQuery, Snowflake)
  • Load raw data, transform with SQL
  • Example tools: dbt, SQLMesh, Dataform

Sample Workflow: CSV Files → Cloud Storage → BigQuery → dbt → Dashboard

Streaming vs Batch Pipelines

Batch: Load entire datasets periodically (hourly, daily)

Streaming: Ingest small updates in near real-time (Kafka, Flink)

Choose batch when:

  • You can tolerate delay
  • Data is static between intervals

Choose streaming when:

  • You need low-latency alerts
  • Data changes rapidly (e.g., sensors, apps)

Some Real World Examples of Cloud-Native Data Warehouses Are:

  • BigQuery: Serverless, ANSI SQL, parallel query, Fast analytics at scale
  • Snowflake: Auto-scaling, time travel, ELT workflows, Collaborative analytics
  • Redshift: Cluster-based MPP, Deep AWS integration

These platforms support:

  • Seamless scaling
  • Data partitioning & clustering
  • Integration with dbt and BI tools

SQL in the Data Pipeline Ecosystem

SQL plays a central role in ELT stacks:

  • Transformation: dbt models written in SQL
  • Validation: assertions, tests, constraints
  • Materialization: views, tables, incremental loads