Chapter 7: “Data Modeling and Preparation”
Databases aren’t just containers for data, they're structured systems. A good data model ensures:
- Consistency: no conflicting or redundant data
- Efficiency: fewer joins, faster queries
- Scalability: easier to expand or change
- Integrity: fewer bugs and edge cases
Whether you're building a student tracking system, a biodiversity survey database, or a business dashboard, modeling your data upfront saves time downstream.
Types of Data Models:
- Conceptual Model: High-level, business-focused, Defines entities and relationships
Example: Students enroll in Courses - Logical Model: Focused on structure, independent of DBMS, Includes primary/foreign keys, attributes
Example: Student(student_id, name) - Physical Model: Implementation-ready, Includes indexes, datatypes, storage engines
It’s important to understand what entities, attributes, and relationships are.
- Entities = Things (e.g., Students, Courses, Surveys)
- Attributes = Properties (e.g., name, date, score)
- Relationships = Connections (e.g., Students take Courses)
Use Entity-Relationship Diagrams (ERDs) to visualize how tables relate. Two important key elements to understand are:
- Primary Key (PK): uniquely identifies a record
- Foreign Key (FK): references a PK in another table
Normalization: Reducing Redundancy
Normalization organizes data to reduce duplication. Normalized tables are easier to maintain, improve data integrity, and reduce update anomalies.
Example: Unnormalized vs Normalized Design
Unnormalized: Student(name, major, course1, course2, course3)
Normalized:
- Student(student_id, name, major_id)
- Course(course_id, title)
- Enrollment(student_id, course_id)
It is important to know when to denormalize. Normalization is great for transactional systems, but analytics may require denormalization for performance.
Denormalization examples:
- Storing course_count in the student table
- Joining location and timestamp into one survey row
Important use cases are when read speed > write speed, you control how often data is refreshed, or joins are slowing down critical queries.
Data Types and Constraints
There are 3 main data types: Integers (INT), Variable Character (VARCHAR), and Booleans (0 or 1). Choosing the right data types is important when you are creating a database.
- Use INT for IDs
- VARCHAR(n) for names
- DATE for dates
- Prefer BOOLEAN to represent true/false
You can also add constraints to enforce rules: The following are important constraints:
- NOT NULL
- UNIQUE
- CHECK
- DEFAULT
Example:
CREATE TABLE survey ( id SERIAL PRIMARY KEY, location TEXT NOT NULL, vegetation_density INTEGER CHECK (vegetation_density BETWEEN 0 AND 100));
Designing for Data Collection:
Before building your database you need to plan for the following:
- Granularity: What does one row represent?
- Consistency: Use controlled vocabularies
- Validation: Use dropdowns and constraints to prevent errors
Data Preparation: Clean Data In = Clean Results Out
Before data hits your database you will need to:
- Clean: remove duplicates, fix typos
- Validate: check ranges, dates, formats
- Standardize: format names, units, values consistently
You can use tools like the following:
- SQL scripts for transformation
- Python (pandas)
- R (dplyr) for preprocessing
- ETL pipelines: dbt, Airflow, Talend
Good modeling and clean data give you a solid foundation. In the next chapter, we’ll explore parallel computing and data pipelines and scaling up how your data is stored, moved, and analyzed.