Chapter 7: “Data Modeling and Preparation”

Databases aren’t just containers for data, they're structured systems. A good data model ensures:

  • Consistency: no conflicting or redundant data
  • Efficiency: fewer joins, faster queries
  • Scalability: easier to expand or change
  • Integrity: fewer bugs and edge cases

Whether you're building a student tracking system, a biodiversity survey database, or a business dashboard, modeling your data upfront saves time downstream.

Types of Data Models:

  • Conceptual Model: High-level, business-focused, Defines entities and relationships
    Example: Students enroll in Courses
  • Logical Model: Focused on structure, independent of DBMS, Includes primary/foreign keys, attributes
    Example: Student(student_id, name)
  • Physical Model: Implementation-ready, Includes indexes, datatypes, storage engines

It’s important to understand what entities, attributes, and relationships are.

  • Entities = Things (e.g., Students, Courses, Surveys)
  • Attributes = Properties (e.g., name, date, score)
  • Relationships = Connections (e.g., Students take Courses)

Use Entity-Relationship Diagrams (ERDs) to visualize how tables relate. Two important key elements to understand are:

  • Primary Key (PK): uniquely identifies a record
  • Foreign Key (FK): references a PK in another table

Normalization: Reducing Redundancy

Normalization organizes data to reduce duplication. Normalized tables are easier to maintain, improve data integrity, and reduce update anomalies.

Example: Unnormalized vs Normalized Design

Unnormalized: Student(name, major, course1, course2, course3)

Normalized:

  • Student(student_id, name, major_id)
  • Course(course_id, title)
  • Enrollment(student_id, course_id)

It is important to know when to denormalize. Normalization is great for transactional systems, but analytics may require denormalization for performance.

Denormalization examples:

  • Storing course_count in the student table
  • Joining location and timestamp into one survey row

Important use cases are when read speed > write speed, you control how often data is refreshed, or joins are slowing down critical queries.

Data Types and Constraints

There are 3 main data types: Integers (INT), Variable Character (VARCHAR), and Booleans (0 or 1). Choosing the right data types is important when you are creating a database.

  • Use INT for IDs
  • VARCHAR(n) for names
  • DATE for dates
  • Prefer BOOLEAN to represent true/false

You can also add constraints to enforce rules: The following are important constraints:

  • NOT NULL
  • UNIQUE
  • CHECK
  • DEFAULT

Example:

CREATE TABLE survey (
  id SERIAL PRIMARY KEY,
  location TEXT NOT NULL,
  vegetation_density INTEGER CHECK (vegetation_density BETWEEN 0 AND 100));
    

Designing for Data Collection:

Before building your database you need to plan for the following:

  • Granularity: What does one row represent?
  • Consistency: Use controlled vocabularies
  • Validation: Use dropdowns and constraints to prevent errors

Data Preparation: Clean Data In = Clean Results Out

Before data hits your database you will need to:

  • Clean: remove duplicates, fix typos
  • Validate: check ranges, dates, formats
  • Standardize: format names, units, values consistently

You can use tools like the following:

  • SQL scripts for transformation
  • Python (pandas)
  • R (dplyr) for preprocessing
  • ETL pipelines: dbt, Airflow, Talend

Good modeling and clean data give you a solid foundation. In the next chapter, we’ll explore parallel computing and data pipelines and scaling up how your data is stored, moved, and analyzed.