Chapter 7: “Data Modeling and Preparation”

Databases aren’t just containers for data, they're structured systems. A good data model ensures:

Consistency: no conflicting or redundant data
Efficiency: fewer joins, faster queries
Scalability: easier to expand or change
Integrity: fewer bugs and edge cases

Whether you're building a student tracking system, a biodiversity survey database, or a business dashboard, modeling your data upfront saves time downstream.

Types of Data Models:

Conceptual Model: High-level, business-focused, Defines entities and relationships
Example: Students enroll in Courses
Logical Model: Focused on structure, independent of DBMS, Includes primary/foreign keys, attributes
Example: Student(student_id, name)
Physical Model: Implementation-ready, Includes indexes, datatypes, storage engines

It’s important to understand what entities, attributes, and relationships are.

Entities = Things (e.g., Students, Courses, Surveys)
Attributes = Properties (e.g., name, date, score)
Relationships = Connections (e.g., Students take Courses)

Use Entity-Relationship Diagrams (ERDs) to visualize how tables relate. Two important key elements to understand are:

Primary Key (PK): uniquely identifies a record
Foreign Key (FK): references a PK in another table

Normalization: Reducing Redundancy

Normalization organizes data to reduce duplication. Normalized tables are easier to maintain, improve data integrity, and reduce update anomalies.

Example: Unnormalized vs Normalized Design

Unnormalized: Student(name, major, course1, course2, course3)

Normalized:

Student(student_id, name, major_id)
Course(course_id, title)
Enrollment(student_id, course_id)

It is important to know when to denormalize. Normalization is great for transactional systems, but analytics may require denormalization for performance.

Denormalization examples:

Storing course_count in the student table
Joining location and timestamp into one survey row

Important use cases are when read speed > write speed, you control how often data is refreshed, or joins are slowing down critical queries.

Data Types and Constraints

There are 3 main data types: Integers (INT), Variable Character (VARCHAR), and Booleans (0 or 1). Choosing the right data types is important when you are creating a database.

Use INT for IDs
VARCHAR(n) for names
DATE for dates
Prefer BOOLEAN to represent true/false

You can also add constraints to enforce rules: The following are important constraints:

NOT NULL
UNIQUE
CHECK
DEFAULT

Example:

CREATE TABLE survey (
  id SERIAL PRIMARY KEY,
  location TEXT NOT NULL,
  vegetation_density INTEGER CHECK (vegetation_density BETWEEN 0 AND 100));

Designing for Data Collection:

Before building your database you need to plan for the following:

Granularity: What does one row represent?
Consistency: Use controlled vocabularies
Validation: Use dropdowns and constraints to prevent errors

Data Preparation: Clean Data In = Clean Results Out

Before data hits your database you will need to:

Clean: remove duplicates, fix typos
Validate: check ranges, dates, formats
Standardize: format names, units, values consistently

You can use tools like the following:

SQL scripts for transformation
Python (pandas)
R (dplyr) for preprocessing
ETL pipelines: dbt, Airflow, Talend

Good modeling and clean data give you a solid foundation. In the next chapter, we’ll explore parallel computing and data pipelines and scaling up how your data is stored, moved, and analyzed.

Next: Chapter 8 →