01: Defensive Programming and Debugging π
Links & Self-Guided ReviewΒΆ
- GitHub Education
- DS-217 Lecture 01
- Markdown Tutorial
- Shell Basics
- Exercism Python Basics
- GitHub Hello World
But First, A Blast from the PastΒΆ
Carryovers from DataSci-217ΒΆ

- You already know Python, git/Markdown, and VS Code basicsβthis lecture focuses on reliability and debugging.
- Pick a workflow (local venv or Codespaces) and stick with it to reduce surprises.
Reference: DS-217 carryoversΒΆ
| Topic | What to reuse |
|---|---|
| Python basics | Functions, imports, venv activation |
| Git hygiene | Small commits, meaningful messages |
| Markdown | Headings, fenced code blocks, links |
Code Snippet: Warmup commandsΒΆ
Command line quick hitsΒΆ
- Same commands everywhere: use the CLI for speed and reproducibility.
- Shell in Jupyter works too (
!ls,!pwd), but keep paths relative.
Reference: Workflow commandsΒΆ
| Command | Purpose |
|---|---|
pwd | Show current directory |
ls -la | List files (long, hidden) |
cd <path> | Change directory |
cp <src> <dst> | Copy files |
mv <src> <dst> | Move/rename |
rm <file> | Remove file (careful) |
Code Snippet: Shell basicsΒΆ
Workflow: local venv or CodespacesΒΆ

- Try the different ways of doing things and pick one workflow (local venv or Codespaces) and stick to it for fewer surprises.
- Local venv for performance/PHI; Codespaces for consistency and easy onboarding.
- Windows: WSL2 +
.venv/Scripts/activatemirrors Linux/Codespaces. - VS Code: Python + Jupyter extensions, format-on-save, debugger panel.
Warning
Beware of Symantec Firewall and WSL, they do NOT like each other.
Reference: Workflow setupΒΆ
| Command | Purpose |
|---|---|
python -m venv .venv | Create isolated environment |
source .venv/bin/activate | Activate venv (Linux/macOS) |
.venv\\Scripts\\activate | Activate venv (Windows) |
pip install -r requirements.txt | Install course dependencies |
Code Snippet: venv + installΒΆ
python -m venv .venv
source .venv/bin/activate # Windows: .venv\\Scripts\\activate
pip install -r requirements.txt
Notebook hygiene and reproducibilityΒΆ

- Run-all ready, deterministic, and no stray outputs or secrets.
- Clear outputs before commits unless the output is the point.
- Keep configs/paths in YAML or
.env
Reference: Notebook hygieneΒΆ
| Practice | Why it matters |
|---|---|
| Clear outputs | Prevent stale screenshots/results |
| Defined requirements | Reproducible environments |
| Relative paths | Portability across machines |
Code Snippet: Clear outputsΒΆ

YAML essentials for config files π§ΎΒΆ
- YAML = βYet Another Markup Language,β but think βplain-English JSONβ with indentation-based structure.
- Use spaces, not tabs, and keep indentation consistent (two spaces is plenty).
- Strings donβt need quotes unless they contain special characters;
#starts a comment. - Great for centralizing run-time knobs like file paths, thresholds, or feature flags.
Reference Card: YAML building blocksΒΆ
| Concept | YAML syntax example | Tip for beginners |
|---|---|---|
| Key/value | project: intake_audit | Keys end with : followed by a value |
| Nested structure | data:\n input_file: data/patients.csv | Indentation defines hierarchy |
| Lists | emails:\n - alice@example.com | Dashes introduce list items |
| Inline objects | height_cm: { min: 120, max: 230 } | Use cautiously; easier to read multiline |
Code Snippet: Sample config.yamlΒΆ
data:
input_file: "data/patient_intake.csv"
bounds:
weight_kg:
min: 30
max: 250
height_cm:
min: 120
max: 230
bmi_thresholds:
underweight: 18.5
normal: 25
overweight: 30 # obese is anything above this
Code Snippet: Load YAML safelyΒΆ
from pathlib import Path
import yaml
CONFIG_PATH = Path("config.yaml")
with CONFIG_PATH.open() as f:
config = yaml.safe_load(f) # safe_load avoids executing arbitrary code
print("BMI thresholds:", config["bmi_thresholds"])
Config Alternative: .envΒΆ
dot-env is a simple way to manage environment variables in Python. It allows you to store sensitive information like API keys and database credentials in a .env file, which is not committed to version control.
Reference: .envΒΆ
Code Snippet: .envΒΆ
import os
import dotenv
# Load environment variables from .env file
dotenv.load_dotenv()
dotenv.load_dotenv("filename.env") # Can also specify a different filename
API_KEY = os.getenv("API_KEY")
DB_PASSWORD = os.getenv("DB_PASSWORD")
Jupyter magics & shell in notebooksΒΆ
- Magics speed up debugging and profiling; shell commands help inspect files without leaving the notebook.
Reference: Jupyter magicsΒΆ
| Magic | Purpose |
|---|---|
%pwd, %ls | Where am I / list files |
%run script.py | Run another script/notebook |
%timeit expr | Quick timing |
%%bash | Run a bash cell |
!ls data | Shell command from a cell |
Code Snippet: MagicsΒΆ
Git/GitHub/Markdown in 5 minutesΒΆ

- Minimal loop: status β add β commit β push.
- Markdown: one
# Title, structured headings, fenced code blocks. - GUI (VS Code Source Control) is fine if it keeps you moving.
Reference: Git/Markdown cheatsheetΒΆ
| Command | Purpose |
|---|---|
git status | See staged/unstaged changes |
git add <path> | Stage files |
git commit -m "feat: message" | Save a snapshot |
git push | Sync to GitHub |
git config user.email | Set author email |
| Markdown | Purpose |
|---|---|
# Heading | Section titles |
- bullet | Lists |
lang | Code fences with language |
[text](url) | Links |
 | Images |
**bold** | Bold text |
_italic_ | Italic text |
> quote | Blockquotes |
Code Snippet: Git loopΒΆ

LIVE DEMOΒΆ
Defensive programming for data scienceΒΆ
Common failure modes in data projectsΒΆ

- Missing columns, unexpected units, unseeded randomness.
- Environment drift: different Python versions or stale venvs.
- PHI leaks via logs or screenshots.
Reference: Failure modesΒΆ
| Risk | Quick defense |
|---|---|
| Missing columns | Assert expected columns |
| Unit drift | Normalize units + validate ranges |
| Stale env | Recreate venv from requirements.txt |
Code Snippet: Assert schemaΒΆ
def assert_expected_columns(df, expected):
missing = [c for c in expected if c not in df]
if missing:
raise ValueError(f"Missing columns: {missing}")
DRY + KISS (and pure functions)ΒΆ
These are important principles in software development, but they are not hard rules. They are guidelines that help you write better code. They are often stored together in a shared lib/ library directory.
- DRY: when you copy/paste the same logic, a bug fix becomes N bug fixes.
- KISS: fewer moving parts means fewer places for bugs to hide.
- Pure functions (same input β same output) are easier to test and debug.
Reference Card: DRY/KISS and pure functionsΒΆ
| Idea | Meaning | Why it matters |
|---|---|---|
| DRY | Donβt repeat yourself | Fix bugs once; change code safely |
| KISS | Keep it simple (small functions, clear names) | Easier to read, debug, and refactor |
| Pure function | No hidden changes; returns a value instead of βdoingβ | Simple unit tests; fewer surprise effects |
Code Snippet: A tiny βpure helperβ functionΒΆ
# Define a function once and use it everywhere!
def normalize_column_name(name: str) -> str:
return name.strip().lower().replace(" ", "_")
Linters (catch issues early)ΒΆ

- Linters flag common mistakes before you run anything.
- They catch typos, unused imports, and inconsistent style.
- In VS Code, they show up as warnings while you type.
Reference Card: Linters in practiceΒΆ
| Tool | What it catches early | Where you see it |
|---|---|---|
| Linter | Unused imports, typos, style issues | VS Code βProblemsβ, squiggles, CI |
Config files (stop hardcoding)ΒΆ
- Config files keep settings out of your code.
- Typical config values: file paths, URLs, API keys (never commit real secrets).
- Should be human readable and easy to edit.
.env- Compatible with shell scripts and environment variables.yaml- Very human-friendly.json- Most common with Javascript and web applications
Reference Card: What belongs in config vs codeΒΆ
| Put it in⦠| Examples |
|---|---|
| Config | data paths, environment names, API endpoints |
| Code | data cleaning logic, feature engineering, models |
Code Snippet: Config helperΒΆ
import yaml
from pathlib import Path
def load_settings(config_path: Path) -> dict:
return yaml.safe_load(config_path.read_text())
settings = load_settings(Path("config.yaml"))
data = pd.read_csv(settings["data"]["path"])
data = pd.read_parquet(settings["data"]["path"])
Code quality toolsΒΆ
- Formatters and linters (
ruff) keep code clean; tests catch regressions. - Run before committing or wire into a pre-commit hook.
Reference: Quality toolsΒΆ
| Tool | Purpose |
|---|---|
ruff | Lint/format fast |
black | Consistent formatting |
Code Snippet: Lint/format/testΒΆ
$ uv run ruff check # or just `ruff check` if installed globally
src/numbers/calculate.py:3:8: F401 [*] `os` imported but unused
Found 1 error.
[*] 1 fixable with the `--fix` option.
# Input
def _make_ssl_transport(
rawsock, protocol, sslcontext, waiter=None,
*, server_side=False, server_hostname=None,
extra=None, server=None,
ssl_handshake_timeout=None,
call_connection_made=True):
'''Make an SSL transport.'''
if waiter is None:
waiter = Future(loop=loop)
if extra is None:
extra = {}
...
$ ruff format example.py
# Ruff
def _make_ssl_transport(
rawsock,
protocol,
sslcontext,
waiter=None,
*,
server_side=False,
server_hostname=None,
extra=None,
server=None,
ssl_handshake_timeout=None,
call_connection_made=True,
):
"""Make an SSL transport."""
if waiter is None:
waiter = Future(loop=loop)
if extra is None:
extra = {}
Automated Testing: pytestΒΆ
This is what we use to automate grading. The tests are run automatically on every commit, but they are not magical. They are only as good as the person who writes them.
- Tests should be deterministic and cover edge cases.
- Use fixtures (known input/output pairs) for setup/teardown.
- Tests pass or fail deterministically.
pytestwill run all files of the formtest_*.pyor*_test.py
Reference: pytest outcomesΒΆ
| Behavior | Result | Example |
|---|---|---|
assert <condition> (True) | β PASS | assert 1 + 1 == 2 |
assert <condition> (False) | β FAIL | assert 1 + 1 == 3 |
raise Exception(...) | β FAIL | raise ValueError("invalid") |
raise expected_exception (caught) | β PASS | with pytest.raises(ValueError): |
| Return normally (no assert/raise) | β PASS | Empty test body passes |
| Timeout or infinite loop | β FAIL | Exceeds max duration |
| Import error or syntax error | β ERROR | import bad_module |
Code Snippet: pytestΒΆ
import pytest
def test_example():
# randomly fail (not a good test)
if random.random() < 0.5:
raise ValueError("Random failure")
assert True

Raising exceptionsΒΆ
Exceptions represent errors or unusual conditions during program execution. When Python encounters a problem (missing file, bad type, invalid value), it raises an exception; if not caught, the program stops with an error message.
- Raise exceptions; avoid bare
except. - Let exceptions bubble up unless you can recover.
- Catch only what you can handle; re-raise if unsure.
Reference: Exception typesΒΆ
| Exception | When to raise | Avoid |
|---|---|---|
ValueError | Invalid argument value | except Exception: |
FileNotFoundError | Missing file | Bare except: |
KeyError | Missing dict key | Generic Exception() |
TypeError | Wrong type | Swallowing errors silently |
Code Snippet: Specific exceptionsΒΆ
from pathlib import Path
def load_data(path: str) -> list[dict]:
csv_path = Path(path)
if not csv_path.exists():
raise FileNotFoundError(f"Missing input: {csv_path}")
if not csv_path.suffix == ".csv":
raise ValueError(f"Expected .csv, got {csv_path.suffix}")
return csv_path.read_text().splitlines()
Catching and handling exceptionsΒΆ
Sometimes you can recover from an error (e.g., missing file β use default). In those cases, catch the exception and handle it gracefully.
- Use
try/exceptto catch exceptions you can recover from. - Be specific: catch the exact exception type, not
Exceptionor bareexcept. - Use
finallyfor cleanup (file closes, locks release) regardless of success or failure.
Reference: try/except/finally patternsΒΆ
| Pattern | Use when | Example |
|---|---|---|
try/except | You can handle the error | File not found β create default |
try/finally | You need cleanup | Always close file handles |
try/except/finally | Both: recover AND clean up | Read file, handle error, close handle |
Re-raise with raise | You caught it but can't fix it | Log error, then re-raise |
Code Snippet: try/except/finallyΒΆ
from pathlib import Path
def read_patient_data(path: str) -> list[dict]:
csv_path = Path(path)
file_handle = None
try:
if not csv_path.exists():
raise FileNotFoundError(f"Missing {csv_path}")
file_handle = open(csv_path)
return [line.strip() for line in file_handle]
except FileNotFoundError as e:
logging.warning("File not found; using empty default: %s", e)
return [] # Recover gracefully
except ValueError as e:
logging.error("Bad data: %s", e)
raise # Can't recover; let caller handle it
finally:
if file_handle:
file_handle.close() # Always runs
LoggingΒΆ
There is more to effective logging than just sprinkling print() statements everywhere. Use the built-in logging module to log at different levels (DEBUG, INFO, WARNING, ERROR).
- Log at the right level; no PHI in logs.
- Use structured messages with context.
- Configure logging format and level at the start of your program.
- Remember: logs are the first place to look when something breaks.
Reference: Logging levelsΒΆ
| Level | Use for | Example |
|---|---|---|
| DEBUG | Detailed diagnostic info | logging.debug("var=%s", var) |
| INFO | High-level progress | logging.info("Starting ETL") |
| WARNING | Non-blocking issues | logging.warning("Missing value at row %d", i) |
| ERROR | Failures needing attention | logging.error("Failed to parse") |
Code Snippet: Logging with checksΒΆ
import logging
import os
# Configure logging at the start of your program
log_level = os.getenv("LOG_LEVEL", "INFO").upper()
logging.basicConfig(level=getattr(logging, log_level), format="%(levelname)s:%(message)s")
# Can also log to a file. In this case, append to 'app.log' insted of overwriting.
logging.basicConfig(filename='app.log', filemode='a', level=logging.INFO)
def load_clean_data(path: str) -> list[dict]:
csv_path = Path(path)
if not csv_path.exists():
raise FileNotFoundError(f"Missing input: {csv_path}")
logging.info("Reading %s", csv_path)
return csv_path.read_text().splitlines()
Fail fast with actionable messagesΒΆ
- Detect problems early in the pipeline.
- Include context in error messages (values, paths, hints).
- Stop gracefully instead of silently producing wrong results.
Reference: Actionable messagesΒΆ
| Bad | Better | Why |
|---|---|---|
raise ValueError("error") | raise ValueError(f"Expected columns {expected}, got {df.columns}") | Includes what was expected vs actual |
assert len(df) > 0 | if len(df) == 0: raise ValueError("Empty input file") | Name the failure, don't just assert |
Silent pass | Raise or log the issue | Prevents hours of debugging downstream |
Code Snippet: Guard clausesΒΆ
def process_patients(df):
if df.empty:
raise ValueError("Empty DataFrame: no patient records to process")
if "patient_id" not in df.columns:
raise ValueError(f"Missing 'patient_id'; got columns: {list(df.columns)}")
if (df["age"] < 0).any():
raise ValueError(f"Negative ages found: {df[df['age'] < 0].index.tolist()}")
logging.info("Processing %d patients", len(df))
return df # Safe to continue
LIVE DEMOΒΆ
Debugging in VS Code + JupyterΒΆ

Debugging toolkit overviewΒΆ
- Start simple with prints/logging; move to pdb/VS Code for deeper inspection.
- Breakpoints + Variables/Watch/Debug Console = see state without littering prints.
Reference: Debugging toolkitΒΆ
| Tool | Use case |
|---|---|
print(f"{var=}") | Quick value checks |
breakpoint() | Drop into pdb |
| VS Code debugger | Visual stepping/inspection |
Code Snippet: Print + calcΒΆ
def calculate_bmi(weight_kg, height_m):
print(f"{weight_kg=}, {height_m=}")
bmi = weight_kg / (height_m ** 2)
print(f"{bmi=}")
return bmi
Print debugging: start hereΒΆ

- Use f-strings with
{var=}to see names + values. - Remove prints before commit or migrate to logging.
Reference: Print patternsΒΆ
| Pattern | Purpose |
|---|---|
print(f"{df.shape=}") | Check dimensions |
print(f"{row=}") | Inspect loop state |
print(f"{result=}") | Verify outputs |
Code Snippet: Print debuggingΒΆ
def calculate_bmi(weight_kg, height_m):
print(f"{weight_kg=}, {height_m=}")
return weight_kg / (height_m ** 2)
VS Code debuggerΒΆ
pdboripdbfor terminal; VS Code for visuals and conditional breakpoints.- Break on exception with
breakpoint()insideexcept.
Reference: Breakpoints & commandsΒΆ
| Tool/command | Purpose |
|---|---|
n / s / c | Next, step into, continue |
p var | Print variable |
| Conditional BP | Pause when expression true |
| Logpoint | Print a message without stop |
breakpoint() | Drop into pdb on exception |
| VS Code gutter | Opens debugger, may be conditional |
]
Code Snippet: Conditional + logpointΒΆ
try:
risky_fn()
except Exception:
breakpoint() # pdb session
# VS Code logpoint: right-click breakpoint -> "Add Logpoint"
# Message example: "value={value}"
Runtime variable inspection in VS CodeΒΆ
- Variables panel shows locals/globals; expand DataFrames.
- Watch expressions track custom values.
- Debug Console evaluates code while paused.

Reference: VS Code panelsΒΆ
| Panel | Purpose |
|---|---|
| Variables | Inspect state at breakpoint |
| Watch | Track expressions (df.shape) |
| Debug Console | Run ad-hoc checks (df.head()) |
Code Snippet: Inspect while pausedΒΆ
# Pause at breakpoint, then:
# - Check Variables panel
# - Add Watch: df.shape
# - Debug Console: df.dtypes
VS Code debugger basics (scripts)ΒΆ

- Click the gutter to set breakpoints (red dot), then use Run and Debug (F5) or the play button to start the Python debugger.
- In Run and Debug, pick the Python config or accept the default; make sure the correct interpreter/venv is selected.
- Inspect call stack, Variables, and Watch panels while stepping; launch.json is optional because the Python extension supplies defaults.
- Debugger will use the default selected Python interpreter in VS Code. Can override in launch.json.
Reference: launch.json fieldsΒΆ
| Field | Meaning |
|---|---|
program | Entry script |
request | launch vs attach |
type | python |
Code Snippet: launch configΒΆ
Example override to debug a specific script with args, env vars, and specific venv. These would equivalent:
{
"version": "0.2.0",
"configurations": [
{
"name": "Debug Buggy BMI Script",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/lectures/01/demo/03a_buggy_bmi.py",
"python": "${workspaceFolder}/.venv/bin/python",
"args": [
"runserver",
"--noreload",
"0.0.0.0:8001",
],
"env": {
"LOG_LEVEL": "DEBUG"
},
"console": "integratedTerminal"
}
]
}
Debugging Notebooks in VS CodeΒΆ

- Click the debug icon on the cell; set breakpoints inside.
- Restart kernel before Run All after debugging.

Reference: Notebook debugging stepsΒΆ
| Step | Purpose |
|---|---|
| Debug cell button | Start a notebook debug session |
| Breakpoints in cell | Pause where needed |
| Restart kernel | Clear state after debugging |
Debugging checklist for messy dataΒΆ

- Reproduce with the smallest failing fixture.
- Check assumptions (types, units, nulls) before changing code.
- Add assertions/logging near the failure and rerun.
- Write a test to prevent regression.
Reference: Debugging checklistΒΆ
| Step | Goal |
|---|---|
| Reproduce | Confirm the failure |
| Minimize | Small fixture for fast loops |
| Guard | Assertions/logging close to bug |
| Test | Lock in the fix with pytest |
Tests to lock in fixesΒΆ
- Save failing fixtures and add tiny tests so bugs stay fixed.
- Prefer small, deterministic inputs; avoid brittle expectations.
Code Snippet: Column validation helperΒΆ
def has_required_columns(df: pd.DataFrame, required: list[str]) -> bool:
"""Check if DataFrame contains all required columns."""
# Reminder: all() returns True only if all elements are True
return all(col in df.columns for col in required)
# Use the helper to check before processing
required = ["patient_id", "height_cm", "weight_kg", "age", "sex"]
# Raise an error if columns are missing
if not has_required_columns(df, required):
# Find which columns are missing for the error message
missing = [col for col in required if col not in df.columns]
raise ValueError(f"Missing required columns: {missing}")
Code Snippet: Test column validationΒΆ
import pytest
def test_detects_missing_height_column():
"""Ensure column check catches missing height_cm in fixture."""
from your_module import has_required_columns
fixture = Path("lectures/01/demo/data/patient_intake_missing_height.csv")
df = pd.read_csv(fixture)
required = ["patient_id", "height_cm", "weight_kg", "age", "sex"]
# Should return False because height_cm is missing
assert has_required_columns(df, required) is False
Rubber duckingΒΆ
Rubber ducking is still undefeated for finding your own bugs.
