03: Join the DISTINCT with SQL

hw03

Links & Self-Guided Review¶

DuckDB docs - embedded analytics database with strong CSV support
JupySQL - SQL magics for notebooks
SQL Style Guide (Mode) - readable query conventions
SQLite docs - lightweight SQL reference
PostgreSQL docs - production SQL dialect used widely in health systems

Outline¶

Why SQL still runs the data world
SQL in notebooks (JupySQL + DuckDB/SQLite)
SQL basics: comments, NULLs, SELECT/WHERE
Live demo 1
SQL statement structure and execution order
Importing data and defining tables
Filtering and aggregation (WHERE vs HAVING)
Joins, subqueries, CTEs, window functions
Views, performance basics, SQL + Python workflows
Live demo 2
Live demo 3

Why SQL still runs the data world¶

SQL is how data teams ask precise questions of large relational datasets. In health data science, it shows up everywhere: EHR extracts, claims data, public health reporting, and analytics layers under dashboards.

Where SQL shows up	What you use it for
EHR reporting	Cohorts, encounters, lab results
Claims data	Cost summaries, utilization, billing audits
Research datasets	Cleaning, joining, aggregating
Analytics stacks	ELT pipelines (extract → load → transform), dashboards, scheduled reports

XKCD: Data Point

Reference Card: Why SQL matters¶

Strength	Why it helps in practice
Declarative	You state what you want, not how to compute it
Scalable	Works on datasets larger than laptop RAM
Transferable	Similar syntax across DuckDB, SQLite, PostgreSQL
Composable	Queries can be layered and reused

Code Snippet: Smallest useful query¶

SELECT patient_id, age, sex
FROM demographics
LIMIT 5;

SQL in notebooks (JupySQL + DuckDB/SQLite)¶

DuckDB is an embedded analytics database that feels like SQLite but is optimized for analytics. JupySQL lets you run SQL directly in notebooks using %sql for single-line and %%sql for multi-line queries, and it works with any SQLAlchemy-supported engine (SQLAlchemy is a Python toolkit for database connections).

JupySQL magics¶

Use %sql for quick, single-line queries and %%sql for multi-line blocks you want to read like a full query.

Reference Card: JupySQL magics¶

Magic	Use case	Example
`%sql`	Single-line query	`%sql SELECT COUNT(*) FROM labs`
`%%sql`	Multi-line query	`%%sql` (cell header)

Code Snippet: One-line query¶

%sql SELECT COUNT(*) FROM labs

DuckDB and SQLite connections¶

Both use lightweight file-based databases, so connection strings point to a local file and keep everything reproducible.

Reference Card: Local connections¶

Engine	Connection string	Notes
DuckDB	`duckdb:///clinic.db`	Fast analytics on local files
SQLite	`sqlite:///clinic.db`	Portable file-based database

Code Snippet: Connect to DuckDB¶

%sql duckdb:///clinic.db

DuckDB + SQLite attachments¶

DuckDB can attach a SQLite database directly so you can query it without conversion. This is DuckDB-specific.

Reference Card: DuckDB SQLite attach¶

Step	Purpose	Example
Install scanner	Load SQLite extension	`INSTALL sqlite_scanner`
Attach DB	Connect file	`ATTACH 'chinook.sqlite' AS chinook (TYPE SQLITE)`
List tables	Inspect schema	`SHOW TABLES FROM chinook`

Code Snippet: Attach SQLite in DuckDB¶

INSTALL sqlite_scanner;
LOAD sqlite_scanner;
ATTACH 'chinook.sqlite' AS chinook (TYPE SQLITE);
SHOW TABLES FROM chinook;

SQL to pandas flow

Reference Card: Notebook setup¶

Step	Command
Install	`pip install duckdb duckdb-engine jupysql pandas`
Load magic	`%load_ext sql`
Connect (DuckDB)	`%sql duckdb:///clinic.duckdb`
Connect (SQLite)	`%sql sqlite:///clinic.sqlite`
Return DataFrame	`result = %sql SELECT * FROM table`
Capture to variable	`%%sql result << SELECT * FROM table`

Code Snippet: Minimal setup¶

import duckdb

%load_ext sql
%sql duckdb:///clinic.db

%%sql
SELECT * FROM demographics
LIMIT 5;

Code Snippet: Return DataFrame (assignment)¶

result = %sql SELECT * FROM demographics LIMIT 5

Code Snippet: Return DataFrame (variable capture)¶

%%sql invoice_counts <<
SELECT BillingCountry, COUNT(*) AS invoice_count
FROM invoices
GROUP BY BillingCountry
ORDER BY invoice_count DESC

Querying DataFrames with SQL¶

DuckDB can query pandas DataFrames directly using replacement scans: use the DataFrame variable name in your SQL query. Replacement scans must be enabled when using SQLAlchemy connections (which JupySQL uses with connection strings).

Reference Card: DataFrame queries¶

Pattern	Usage
Enable scans (required)	`%sql SET python_scan_all_frames=true`
Query DataFrame	`%sql SELECT * FROM df WHERE x >= 5`
Capture result	`%%sql result << SELECT * FROM df`

Code Snippet: Query a DataFrame¶

# Assume we already have duckdb connected and pandas imported
# Enable DataFrame replacement scans (required for SQLAlchemy connections)
%sql SET python_scan_all_frames=true

# Create a DataFrame
df = pd.DataFrame({
    'patient_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 40, 45],
    'department': ['Cardio', 'Neuro', 'Cardio', 'Neuro', 'Cardio']
})

# Query it with SQL (now works because replacement scans are enabled)
%sql SELECT department, COUNT(*) AS count FROM df GROUP BY department

SQL basics¶

These are the building blocks of most queries. Treat them as a checklist when reading or writing SQL.

Comments, semicolons, and NULLs¶

SQL statements end with semicolons, and comments start with --. NULL means missing, so comparisons need IS NULL rather than =.

Value	Interpretation	Correct check
`NULL`	Missing value	`IS NULL` / `IS NOT NULL`
Empty string	Present but blank	`= ''`
Zero	Numeric value	`= 0`

Reference Card: NULL handling¶

Pattern	Example	Notes
Check missing	`WHERE lab_value IS NULL`	NULL is not equal to NULL
Fill missing	`COALESCE(lab_value, 0)`	Replace NULL with default
Safer filters	`WHERE lab_value IS NOT NULL`	Avoid accidental drops

Code Snippet: NULL safe filtering¶

SELECT patient_id, test_name, value
FROM labs
WHERE value IS NOT NULL
ORDER BY test_name;

COALESCE¶

Use COALESCE to replace missing values with a default during queries.

Reference Card: COALESCE¶

Function	Purpose	Example
`COALESCE`	Fill NULLs	`COALESCE(lab_value, 0)`

Code Snippet: Fill NULLs¶

SELECT patient_id, COALESCE(lab_value, 0) AS lab_value_filled
FROM labs;

SELECT, WHERE, ORDER BY, LIMIT, DISTINCT¶

This is the core of most queries: choose columns, filter rows, and order the output. Learn these first and most day-to-day SQL becomes readable.

patient_id	encounter_date	department	total_cost
2009	2024-03-06	Cardiology	560.00
2006	2024-03-01	Emergency	980.00
2010	2024-02-28	Emergency	1500.00

Reference Card: Core clauses¶

Clause	Purpose	Example
`SELECT`	Choose columns	`SELECT patient_id, age`
`WHERE`	Filter rows	`WHERE age >= 18`
`ORDER BY`	Sort results	`ORDER BY total_cost DESC`
`LIMIT`	Keep top N	`LIMIT 10`
`DISTINCT`	Unique values	`SELECT DISTINCT department`

Reference Card: WHERE operators¶

Operator	Purpose	Example
Comparisons	Match values	`age >= 18`
Logical	Combine conditions	`age >= 18 AND sex = 'F'`
`IN`	Match a set	`department IN ('ER', 'ICU')`
`BETWEEN`	Match ranges	`total_cost BETWEEN 100 AND 500`
`LIKE`	Pattern match	`test_name LIKE '%A1C%'`

Code Snippet: Basic filtering¶

SELECT encounter_id, patient_id, department, total_cost
FROM encounters
WHERE total_cost >= 500
ORDER BY total_cost DESC
LIMIT 5;

CASE WHEN¶

Use CASE WHEN to create conditional categories inside a query.

Reference Card: CASE WHEN¶

Pattern	Purpose	Example
`CASE WHEN ... THEN ... END`	Conditional logic	`CASE WHEN age >= 65 THEN 'senior' ELSE 'adult' END`

Code Snippet: Conditional labels¶

SELECT patient_id,
    CASE
        WHEN age >= 65 THEN 'senior'
        WHEN age >= 18 THEN 'adult'
        ELSE 'minor'
    END AS age_group
FROM demographics;

COUNT(DISTINCT ...)¶

Use COUNT(DISTINCT ...) to count unique values instead of rows.

Reference Card: COUNT DISTINCT¶

Pattern	Purpose	Example
`COUNT(DISTINCT col)`	Unique count	`COUNT(DISTINCT patient_id)`

Code Snippet: Unique patients per department¶

SELECT department, COUNT(DISTINCT patient_id) AS unique_patients
FROM encounters
GROUP BY department;

DISTINCT vs GROUP BY¶

Use DISTINCT for unique row combinations and GROUP BY when you need aggregates.

Reference Card: DISTINCT vs GROUP BY¶

Tool	Best for	Example
`DISTINCT`	Unique rows	`SELECT DISTINCT department FROM encounters`
`GROUP BY`	Aggregation	`SELECT department, COUNT(*) FROM encounters GROUP BY department`

Code Snippet: Unique departments¶

SELECT DISTINCT department
FROM encounters
ORDER BY department;

IN vs EXISTS¶

Use IN for small sets and EXISTS for correlated subqueries.

Reference Card: IN vs EXISTS¶

Tool	Best for	Example
`IN`	Small list or subquery	`WHERE id IN (SELECT ...)`
`EXISTS`	Correlated checks	`WHERE EXISTS (SELECT 1 ...)`

Code Snippet: EXISTS pattern¶

SELECT d.patient_id, d.age
FROM demographics AS d
WHERE EXISTS (
    SELECT 1
    FROM labs AS l
    WHERE l.patient_id = d.patient_id
);

LIKE vs ILIKE¶

Use LIKE for case-sensitive matching and ILIKE for case-insensitive matching when supported.

Reference Card: LIKE vs ILIKE¶

Tool	Purpose	Example
`LIKE`	Case-sensitive match	`test_name LIKE '%A1C%'`
`ILIKE`	Case-insensitive match	`test_name ILIKE '%a1c%'`

Code Snippet: Pattern matching¶

SELECT patient_id, test_name
FROM labs
WHERE test_name ILIKE '%a1c%';

LIVE DEMO!¶

Importing data and defining tables¶

Real projects start with files. You can load CSVs directly, scan Parquet for columnar files (columnar means values are stored by column for faster analytics), or create tables with explicit types for safer analysis.

flowchart LR
    A[Raw CSV/Parquet files] --> B[Ingest (read_csv_auto / read_parquet)]
    B --> C[Typed table]
    C --> D[Query and reuse]

DuckDB `read_csv_auto`¶

read_csv_auto is DuckDB-specific and auto-detects column types for quick exploration.

Reference Card: `read_csv_auto` (DuckDB)¶

Feature	Benefit	Example
Auto schema	Fast exploration	`SELECT * FROM read_csv_auto('file.csv')`

Code Snippet: Quick CSV scan¶

SELECT * FROM read_csv_auto('demographics.csv');

DuckDB `read_parquet`¶

read_parquet is DuckDB-specific and reads columnar Parquet files efficiently.

Reference Card: `read_parquet` (DuckDB)¶

Feature	Benefit	Example
Columnar reads	Faster analytics	`SELECT * FROM read_parquet('file.parquet')`

Code Snippet: Parquet scan¶

SELECT * FROM read_parquet('encounters.parquet');

COPY for bulk load¶

Use COPY when you need to ingest a large file quickly into a table.

Reference Card: COPY¶

Option	Purpose	Example
`HEADER`	Skip header row	`HEADER true`
`DELIMITER`	Customize separator	`DELIMITER ','`

Code Snippet: COPY from CSV¶

COPY demographics
FROM 'demographics.csv'
(FORMAT CSV, HEADER true);

COPY for exports¶

Use COPY ... TO when you want to write query results to CSV or Parquet.

Reference Card: COPY TO¶

Format	Example	Notes
CSV	`COPY (...) TO 'file.csv' (HEADER)`	Easy to inspect
Parquet	`COPY (...) TO 'file.parquet' (FORMAT PARQUET)`	Efficient analytics

Code Snippet: Export to CSV and Parquet¶

COPY (SELECT * FROM demographics LIMIT 100)
TO 'demographics_sample.csv' (HEADER, DELIMITER ',');

COPY (SELECT * FROM demographics LIMIT 100)
TO 'demographics_sample.parquet' (FORMAT PARQUET);

CREATE TABLE AS COPY¶

Create and populate a table in one step when you want an explicit schema.

Reference Card: CREATE TABLE AS COPY¶

Benefit	Use case
Single step	Reproducible loads
Explicit types	Safer analysis

Code Snippet: Create table + load¶

CREATE TABLE demographics (
    patient_id INTEGER,
    age INTEGER,
    sex TEXT
) AS COPY FROM 'demographics.csv'
WITH (FORMAT CSV, HEADER true);

Reference Card: Common import patterns¶

Pattern	Use case	Example
`read_csv_auto`	Quick CSV exploration	`SELECT * FROM read_csv_auto('file.csv')`
`read_parquet`	Fast columnar reads	`SELECT * FROM read_parquet('file.parquet')`
`CREATE TABLE AS`	Persist results	`CREATE TABLE t AS SELECT ...`
`COPY`	Fast bulk load	`COPY t FROM 'file.csv' (HEADER true)`

Code Snippet: Create table from CSV¶

CREATE TABLE demographics AS
SELECT * FROM read_csv_auto('demographics.csv');

Filtering and aggregation¶

Filtering shrinks the data to what you need. Aggregation summarizes it to the level you want to report. Use WHERE to filter rows before grouping and HAVING to filter groups after aggregation; if you are unsure, start with WHERE.

WHERE vs HAVING¶

WHERE filters raw rows, and HAVING filters aggregated groups after GROUP BY.

Reference Card: WHERE vs HAVING¶

Clause	Applies to	Example
`WHERE`	Rows before grouping	`WHERE total_cost > 500`
`HAVING`	Groups after `GROUP BY`	`HAVING COUNT(*) > 3`

GROUP BY and aggregates¶

Aggregates like COUNT and AVG summarize each group into a single row of results.

Reference Card: Aggregates¶

Function	Purpose	Example
`COUNT`	Row counts	`COUNT(*)`
`AVG`	Mean	`AVG(total_cost)`
`SUM`	Totals	`SUM(total_cost)`

flowchart LR
    A[Raw rows] --> B[WHERE filters rows]
    B --> C[GROUP BY aggregates]
    C --> D[HAVING filters groups]

Department	Visit count	Avg cost
Cardiology	7	812.50
Oncology	5	1230.20
Primary Care	12	210.30

Reference Card: Filtering + aggregates¶

Tool	Example	Purpose
`GROUP BY`	`GROUP BY department`	Define groups
`HAVING`	`HAVING COUNT(*) > 3`	Filter groups
Aggregates	`COUNT`, `AVG`, `SUM`	Summarize

Code Snippet: Filter then group¶

SELECT department,
    COUNT(*) AS visit_count,
    AVG(total_cost) AS avg_cost
FROM encounters
WHERE encounter_date >= '2024-01-01'
GROUP BY department
ORDER BY avg_cost DESC;

ROUND¶

Use ROUND to tidy numeric summaries for reporting after you aggregate.

Reference Card: ROUND¶

Function	Purpose	Example
`ROUND`	Set decimal places	`ROUND(AVG(total_cost), 2)`

Code Snippet: Rounded averages¶

SELECT department, ROUND(AVG(total_cost), 2) AS avg_cost
FROM encounters
GROUP BY department;

XKCD: Selection Bias

Joins: combine tables safely¶

Joins connect demographics, encounters, and lab results. Always join on keys and check for row explosion (many-to-many joins can multiply rows).

Joins¶

Plan joins like data links: define the primary key, check duplicates, and test row counts before and after.

Reference Card: Join checks¶

Check	Why it matters	Quick test
Key uniqueness	Prevent row explosion	`COUNT(*) vs COUNT(DISTINCT key)`
Row counts	Detect unexpected growth	Compare counts pre/post join

Code Snippet: Pre-join key check¶

SELECT COUNT(*) AS rows, COUNT(DISTINCT patient_id) AS patients
FROM encounters;

Join types¶

Join type determines which rows survive when keys are missing or unmatched.

Join types

Reference Card: Join types¶

Join	Keeps rows from	When to use
`INNER JOIN`	Both tables	Only matched records
`LEFT JOIN`	Left table	Keep all patients, add matches
`RIGHT JOIN`	Right table	Rare in analytics
`FULL JOIN`	Both tables	Audits, reconciliation

Code Snippet: Encounter + demographics¶

SELECT e.encounter_id, e.department, d.age, d.sex
FROM encounters AS e
LEFT JOIN demographics AS d
    ON e.patient_id = d.patient_id;

ON vs USING¶

Use ON when columns differ in name or you need a more complex condition. Use USING when column names match and you want a cleaner query.

Reference Card: ON vs USING¶

Syntax	Best for	Example
`ON`	Different column names or conditions	`ON e.patient_id = d.patient_id`
`USING`	Same column name in both tables	`USING (patient_id)`

Code Snippet: USING clause¶

SELECT *
FROM encounters
INNER JOIN demographics
USING (patient_id);

flowchart LR
    A[encounters.patient_id] -->|ON e.patient_id = d.patient_id| C[Join]
    B[demographics.patient_id] -->|USING (patient_id)| C

UNION ALL¶

Use UNION ALL to stack compatible result sets without de-duplicating rows.

Reference Card: UNION ALL¶

Operator	Purpose	Example
`UNION ALL`	Stack results	`SELECT ... UNION ALL SELECT ...`

Code Snippet: Stack row counts¶

SELECT 'encounters' AS table_name, COUNT(*) AS rows FROM encounters
UNION ALL
SELECT 'labs', COUNT(*) FROM labs
ORDER BY table_name;

SQL statement structure and execution order¶

SQL reads like English, but it executes in a specific order. When a query is confusing, rewrite it as a pipeline: join tables, filter rows, group, compute, then sort.

Statement structure¶

Write SQL in the order you want it to read: SELECT what you want, FROM where it lives, and WHERE how you filter.

Reference Card: Readable structure¶

Step	Purpose	Example
`SELECT`	Choose columns	`SELECT patient_id, age`
`FROM`	Pick table	`FROM demographics`
`WHERE`	Filter rows	`WHERE age >= 18`

Code Snippet: Skeleton query¶

SELECT patient_id, age
FROM demographics
WHERE age >= 18;

Execution order¶

Execution happens from FROM/JOIN to WHERE to GROUP BY, then SELECT and ORDER BY, which explains why aliases are sometimes unavailable in WHERE.

SQL execution order

Read a query top to bottom in this order:

Build tables with FROM + JOIN
Filter rows with WHERE
Aggregate with GROUP BY and filter groups with HAVING
Compute output with SELECT
Sort and trim with ORDER BY + LIMIT

Reference Card: Execution order (simplified)¶

Order	Clause
1	`FROM` + `JOIN`
2	`WHERE`
3	`GROUP BY`
4	`HAVING`
5	`SELECT`
6	`ORDER BY`
7	`LIMIT`

Code Snippet: Readable query template¶

SELECT department, COUNT(*) AS visit_count
FROM encounters
WHERE total_cost > 500
GROUP BY department
HAVING COUNT(*) >= 3
ORDER BY visit_count DESC
LIMIT 5;

LIVE DEMO!!¶

Subqueries and CTEs¶

Subqueries let you nest logic. CTEs (WITH) make that logic readable and reusable, which matters for multi-step cohort definitions (a cohort is a defined group of patients).

Subqueries¶

Use a subquery when you need a compact, one-off filter or calculation inside a larger query.

Reference Card: Subqueries¶

Pattern	Use case	Example
`IN (SELECT ...)`	Filter by a derived set	`WHERE id IN (SELECT ...)`
Scalar subquery	Single value in `SELECT`	`(SELECT MAX(...))`

Code Snippet: Subquery filter¶

SELECT patient_id, age
FROM demographics
WHERE patient_id IN (
    SELECT patient_id
    FROM labs
    WHERE test_name = 'A1C' AND value >= 7.0
);

CTEs¶

Use a CTE when the logic has multiple steps or needs to stay readable across several joins or filters.

Reference Card: CTEs¶

Feature	Benefit	Example
`WITH name AS (...)`	Reuse readable steps	`WITH cohort AS (...)`

flowchart LR
    A[Main query] --> B[Inline subquery]
    A --> C[CTE: WITH cohort AS (...)]
    C --> D[Main query uses cohort]

Reference Card: Subqueries vs CTEs¶

Pattern	Best for	Example
Subquery	Small, single use logic	`WHERE id IN (SELECT ...)`
CTE	Multi-step, readable logic	`WITH cohort AS (...)`

Code Snippet: CTE for a cohort¶

WITH high_a1c AS (
    SELECT patient_id
    FROM labs
    WHERE test_name = 'A1C' AND value >= 7.0
)
SELECT d.patient_id, d.age, d.sex
FROM demographics AS d
INNER JOIN high_a1c AS h
    ON d.patient_id = h.patient_id;

Window functions¶

Window functions compute metrics across related rows without collapsing them. They are common in claims and encounter analytics (running totals, most recent visit, rank within a patient).

patient_id	encounter_date	total_cost	running_cost
101	2024-01-10	220.00	220.00
101	2024-02-03	125.00	345.00
101	2024-03-02	410.00	755.00

Reference Card: Window function patterns¶

Function	Use case	Example
`ROW_NUMBER`	Order rows	`ROW_NUMBER() OVER (...)`
`LAG`	Compare to previous	`LAG(total_cost)`
`SUM`	Running totals	`SUM(total_cost) OVER (...)`

Code Snippet: Running total per patient¶

SELECT patient_id,
    encounter_date,
    total_cost,
    SUM(total_cost) OVER (
        PARTITION BY patient_id
        ORDER BY encounter_date
    ) AS running_cost
FROM encounters
ORDER BY patient_id, encounter_date;

XKCD: Proxy Variable

Data modification (INSERT, UPDATE, DELETE)¶

These statements change data in place. Use them sparingly in analytics and favor scratch databases when you need to write.

flowchart LR
    A[SELECT preview] --> B[INSERT/UPDATE/DELETE]
    B --> C[Validate row counts]
    C --> D[Commit or rollback]

Reference Card: Data modification statements¶

Statement	Use case	Guardrail
`INSERT`	Add rows	Use explicit column lists
`UPDATE`	Edit rows	Always include a `WHERE`
`DELETE`	Remove rows	Preview with `SELECT` first

INSERT¶

Use explicit column lists and treat inserts as append-only in analytics workflows.

Code Snippet: Safe insert¶

INSERT INTO solution (id, name)
VALUES (1, 'Ada Lovelace');

UPDATE¶

Always include a WHERE and check the affected rows before committing the change.

Reference Card: UPDATE¶

Guardrail	Why it helps
`WHERE` required	Prevents full-table edits
Preview first	Confirms target rows

Code Snippet: Safe update¶

UPDATE encounters
SET total_cost = total_cost * 0.9
WHERE department = 'Cardiology';

DELETE¶

Preview the rows with a SELECT first, then delete with the same WHERE filter.

Reference Card: DELETE¶

Guardrail	Why it helps
Preview with `SELECT`	Validate target rows
Limit scope	Avoid accidental full delete

Code Snippet: Targeted delete¶

DELETE FROM encounters
WHERE encounter_date < '2010-01-01';

CAST¶

Use CAST to convert types explicitly before comparisons or aggregations.

Reference Card: CAST¶

Function	Purpose	Example
`CAST`	Convert types	`CAST(encounter_date AS DATE)`

Code Snippet: Cast to DATE¶

SELECT CAST(encounter_date AS DATE) AS visit_date, COUNT(*) AS visits
FROM encounters
GROUP BY CAST(encounter_date AS DATE)
ORDER BY visit_date;

date_part¶

Use date_part to extract date pieces for grouping and filtering.

Reference Card: date_part¶

Function	Purpose	Example
`date_part`	Extract date parts	`date_part('month', encounter_date)`

Code Snippet: Group by month¶

SELECT date_part('month', encounter_date) AS month, COUNT(*) AS visits
FROM encounters
GROUP BY month
ORDER BY month;

DATE_TRUNC¶

Use DATE_TRUNC to bucket timestamps into consistent time windows.

Reference Card: DATE_TRUNC¶

Function	Purpose	Example
`DATE_TRUNC`	Time buckets	`DATE_TRUNC('month', encounter_date)`

Code Snippet: Monthly buckets¶

SELECT DATE_TRUNC('month', encounter_date) AS month, COUNT(*) AS visits
FROM encounters
GROUP BY month
ORDER BY month;

Views and materialized views¶

Views save a query definition. Materialized views store a cached snapshot of results, which can speed up dashboards at the cost of refresh time.

Views¶

Use views to keep a canonical definition of common queries without duplicating SQL logic.

Reference Card: Views¶

Property	Notes
Stored data	No
Refresh	Always current

Materialized views¶

Use materialized views when you can afford scheduled refreshes in exchange for faster reads.

Reference Card: Materialized views¶

Property	Notes
Stored data	Yes
Refresh	Manual or scheduled

flowchart LR
    A[Query definition] --> B[View (no stored data)]
    A --> C[Materialized view (stored snapshot)]
    C --> D[Refresh schedule]

Code Snippet: Create a view¶

CREATE VIEW high_cost_encounters AS
SELECT encounter_id, patient_id, total_cost
FROM encounters
WHERE total_cost >= 1000;

Code Snippet: Create a materialized view¶

CREATE MATERIALIZED VIEW high_cost_encounters_mv AS
SELECT encounter_id, patient_id, total_cost
FROM encounters
WHERE total_cost >= 1000;

Performance basics¶

Performance depends on reading less data and doing work once. A query plan is the database's step-by-step execution order; start with filters and clear joins before reaching for complex optimizations.

flowchart LR
    A[Query] --> B[EXPLAIN plan]
    B --> C[Find slow step]
    C --> D[Optimize: filter, index, join]

XKCD: Complexity Analysis

Reference Card: Performance checklist¶

Tip	Why it helps
Filter early	Less data to join and aggregate
Select only needed columns	Reduces memory usage
Use indexes (when supported)	Faster lookups
Inspect query plans	Find slow steps

Code Snippet: Explain a query¶

EXPLAIN
SELECT department, COUNT(*)
FROM encounters
GROUP BY department;

SQL + Python workflows¶

SQL is great for filtering and aggregation; Python is great for modeling and visualization. Use SQL to reduce the dataset, then load the result into pandas for analysis. DuckDB and SQLite are easy local options; for external databases, connect via SQLAlchemy and reuse the same query strings.

SQL for reduction¶

Use SQL to filter to cohorts, aggregate to reporting levels, and shrink the data before it hits memory.

Reference Card: Reduction patterns¶

Pattern	Goal	Example
Filter	Keep cohort	`WHERE age >= 18`
Aggregate	Summarize	`GROUP BY department`

Code Snippet: Reduce before pandas¶

SELECT department, COUNT(*) AS visit_count
FROM encounters
WHERE encounter_date >= '2024-01-01'
GROUP BY department;

Python for analysis¶

Load the reduced dataset into pandas for plots, modeling, and downstream feature engineering.

Reference Card: pandas entry points¶

Method	When to use
`pd.read_sql`	SQLAlchemy connections
`fetch_df`	DuckDB result to DataFrame

Code Snippet: pandas from SQLAlchemy¶

report = pd.read_sql(query, engine)

SQLite connections¶

Use SQLite for lightweight, file-based workflows that still support standard SQL.

Reference Card: SQLite¶

Tool	Usage
`sqlite3.connect`	`sqlite3.connect("clinic.db")`
`pd.read_sql_query`	`pd.read_sql_query(query, conn)`

Code Snippet: SQLite to pandas¶

import sqlite3
import pandas as pd

conn = sqlite3.connect("clinic.db")
query = "SELECT department, COUNT(*) AS visit_count FROM encounters GROUP BY department"
report = pd.read_sql_query(query, conn)

Step	Tool	Output
Query	SQL	Table of filtered rows
Analyze	pandas	DataFrame, plots
Model	sklearn	Features, predictions

Reference Card: pandas interop¶

Method	Usage
`pd.read_sql`	`pd.read_sql(query, connection)`
DuckDB to pandas	`duckdb.sql(query).df()`
Save results	`df.to_parquet("outputs/report.parquet")`

Code Snippet: SQLAlchemy connection¶

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("postgresql://user:password@localhost:5432/clinic")
query = "SELECT department, COUNT(*) AS visit_count FROM encounters GROUP BY department"
report = pd.read_sql(query, engine)

Code Snippet: SQL to DataFrame¶

import duckdb
import pandas as pd

con = duckdb.connect("clinic.db")
query = """
SELECT department, COUNT(*) AS visit_count
FROM encounters
GROUP BY department
"""
report = con.execute(query).fetch_df()
print(report.head())

LIVE DEMO!!!¶

Advanced SQL patterns (optional)¶

These patterns are useful in production analytics but are not required for the demos.

NULLIF for safe division¶

Use NULLIF to avoid divide-by-zero errors in ratios.

Reference Card: NULLIF¶

Function	Purpose	Example
`NULLIF`	Return NULL on match	`NULLIF(denominator, 0)`

Code Snippet: Safe rate¶

SELECT numerator / NULLIF(denominator, 0) AS rate
FROM metrics;

ORDER BY NULLS FIRST/LAST¶

Some dialects let you control where NULLs appear in sorted results.

Reference Card: NULL ordering¶

Clause	Purpose	Example
`NULLS FIRST`	NULLs at top	`ORDER BY lab_value NULLS FIRST`
`NULLS LAST`	NULLs at end	`ORDER BY lab_value NULLS LAST`

Code Snippet: Explicit NULL ordering¶

SELECT patient_id, lab_value
FROM labs
ORDER BY lab_value NULLS LAST;

LIMIT and OFFSET¶

Use pagination to fetch results in smaller chunks.

Reference Card: Pagination¶

Clause	Purpose	Example
`LIMIT`	Cap rows	`LIMIT 50`
`OFFSET`	Skip rows	`OFFSET 100`

Code Snippet: Page through results¶

SELECT encounter_id, patient_id
FROM encounters
ORDER BY encounter_id
LIMIT 50 OFFSET 100;

INSERT ... SELECT¶

Use query results to populate another table.

Reference Card: INSERT SELECT¶

Pattern	Purpose	Example
`INSERT ... SELECT`	Load from query	`INSERT INTO t SELECT ...`

Code Snippet: Insert cohort¶

INSERT INTO high_cost_encounters
SELECT * FROM encounters
WHERE total_cost >= 1000;

UPDATE ... FROM¶

Use joins to update rows based on another table.

Reference Card: UPDATE FROM¶

Pattern	Purpose	Example
`UPDATE ... FROM`	Join-based update	`UPDATE t SET ... FROM s WHERE ...`

Code Snippet: Update with join¶

UPDATE encounters AS e
SET department = d.department
FROM departments AS d
WHERE e.department_id = d.department_id;