Classification: Putting a Label on Things

References

Documentation

Deep Learning Frameworks

Books

Tutorials & Articles

Health Data Examples

Crash Course in Classification

XKCD classification

The building blocks of ML are algorithms for regression and classification:

Classification Methods

Classification algorithms learn to assign labels to data points based on their features. In health data, this might mean predicting whether a patient has a disease based on lab results.

Reference Card: Common Classification Methods

Method Description Strengths
Logistic Regression Linear model for binary outcomes Easy to interpret, fast
Decision Trees Tree structure, splits data by feature values Intuitive but can overfit
Random Forest Many decision trees combined More robust, less overfitting
Support Vector Machines (SVM) Finds the best boundary between classes Good for complex data
Naive Bayes Probabilistic, assumes features are independent Fast and simple
Neural Networks Layers of nodes, can model complex patterns Powerful but less interpretable

it's all statistics

Model Evaluation

There are many more classification approaches than data scientists, so choosing the best one for your application can be daunting. Thankfully, all of them output predicted classes for each data point. We can use this similarity to define objective performance criteria based on how often the predicted class matches the underlying truth.

I get in trouble with the data science police if I don’t include something about confusion matrices:

Evaluation metrics

Reference Card: confusion_matrix

Component Details
Function sklearn.metrics.confusion_matrix()
Purpose Compute confusion matrix to evaluate classification accuracy
Key Parameters y_true: Ground truth target values
y_pred: Estimated targets from classifier
labels: List of labels to index the matrix
normalize: Normalize over ‘true’, ‘pred’, ‘all’, or None
Returns Array where rows are actual, columns are predicted

Code Snippet: Confusion Matrix

from sklearn.metrics import confusion_matrix
y_true = [1, 0, 1, 1, 0]  # Actual labels
y_pred = [1, 0, 0, 1, 1]  # Predicted labels
cm = confusion_matrix(y_true, y_pred)
print(cm)
# Output interpretation (for binary case):
# [[TN, FP],
#  [FN, TP]]

Classification Report

The classification_report function provides a comprehensive summary of precision, recall, and F1-score for each class in a single call—essential for evaluating multi-class models.

Reference Card: classification_report

Component Details
Function sklearn.metrics.classification_report()
Purpose Build a text report showing main classification metrics per class
Key Parameters y_true: Ground truth target values
y_pred: Estimated targets from classifier
target_names: Display names for classes
output_dict: Return dict instead of string
Returns String (or dict) with precision, recall, F1-score, support per class

Code Snippet: Classification Report

from sklearn.metrics import classification_report

y_true = [0, 0, 1, 1, 1, 2, 2]
y_pred = [0, 1, 1, 1, 0, 2, 2]

print(classification_report(y_true, y_pred, target_names=['No Disease', 'Mild', 'Severe']))
#               precision    recall  f1-score   support
# 
#   No Disease       0.50      0.50      0.50         2
#         Mild       0.67      0.67      0.67         3
#       Severe       1.00      1.00      1.00         2
# 
#     accuracy                           0.71         7
#    macro avg       0.72      0.72      0.72         7
# weighted avg       0.71      0.71      0.71         7

ROC Curve and AUC

An ROC curve (Receiver Operating Characteristic curve) shows how a classifier’s performance changes as you vary the threshold for predicting “positive.” It plots:

AUROC example

Reference Card: roc_curve and roc_auc_score

Component Details
roc_curve() Returns FPR, TPR, and thresholds for plotting
roc_auc_score() Returns the Area Under the ROC Curve
Key Parameters y_true: True binary labels
y_score: Probability estimates (use predict_proba()[:, 1])
Interpretation AUC = 0.5 is random guessing; AUC = 1.0 is perfect

Code Snippet: ROC and AUC

from sklearn.metrics import roc_curve, roc_auc_score

y_true = [0, 0, 1, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]  # Model probabilities

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc_score = roc_auc_score(y_true, y_scores)
print(f"AUC Score: {auc_score}")  # Output: AUC Score: 0.75

AUC is desirable for two reasons:

However, both these reasons come with caveats:

Supervised vs. Unsupervised

There are two(-ish) overarching categories of classification algorithms: supervised and unsupervised. There are many possible approaches in each category, and some that work well in both (deep learning, for example).

Supervised Models

To fairly evaluate each model, we must test its performance on different data than it was trained on. So we split our dataset into two partitions: test and train:

A note on terminology: The standard convention is train/validation/test, where validation is used for model development and test is the final holdout. However, sklearn’s function is called train_test_split() and cross-validation literature often refers to “test folds”—so you’ll see some inconsistency in practice. We’ll follow the standard: validation for tuning/selection, test for final evaluation.

Reference Card: train_test_split

Component Details
Function sklearn.model_selection.train_test_split()
Purpose Split arrays into random train and test subsets
Key Parameters test_size: Proportion for test (default 0.25)
random_state: Seed for reproducibility
stratify: Preserve class ratios in splits (pass y for classification)
Returns X_train, X_test, y_train, y_test

Tip: Always use stratify=y for classification tasks—especially with imbalanced classes—to ensure both train and test sets have similar class distributions.

Code Snippet: Train/Test Split

from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
y = np.array([0, 1, 0, 1, 0, 1])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

Cross-Validation for Model Comparison

A single train/test split can be misleading if the split happens to be particularly easy or hard. Cross-validation provides more robust performance estimates by training and testing on multiple different splits of the data.

K-Fold Cross-Validation:

  1. Split data into k equal parts (folds)
  2. Train on k-1 folds, test on the remaining fold
  3. Repeat k times, each fold serving as test once
  4. Average the results for final performance estimate

Reference Card: cross_val_score

Component Details
Function sklearn.model_selection.cross_val_score()
Purpose Evaluate model with cross-validation, returning scores for each fold
Key Parameters estimator: Model to evaluate
X, y: Features and labels
cv: Number of folds or CV splitter
scoring: Metric to use (‘accuracy’, ‘f1’, ‘roc_auc’)
Returns Array of scores, one per fold

Code Snippet: Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X = np.random.randn(100, 5)
y = np.random.randint(0, 2, 100)

model = RandomForestClassifier(n_estimators=50, random_state=42)

# Simple 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1 scores per fold: {scores}")
print(f"Mean F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Reference Card: StratifiedKFold

Component Details
Function sklearn.model_selection.StratifiedKFold()
Purpose K-fold iterator that preserves class distribution in each fold
Key Parameters n_splits: Number of folds (default 5)
shuffle: Shuffle data before splitting
random_state: Seed for reproducibility
Use With Pass to cross_val_score(cv=...) or iterate manually

Code Snippet: StratifiedKFold

from sklearn.model_selection import StratifiedKFold

# Create a stratified k-fold splitter
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Iterate over folds manually (useful for custom training loops)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train_fold, X_val_fold = X[train_idx], X[val_idx]
    y_train_fold, y_val_fold = y[train_idx], y[val_idx]
    print(f"Fold {fold}: train={len(train_idx)}, val={len(val_idx)}")

Model Comparison Workflow

Once you have evaluation metrics and cross-validation set up, the typical workflow for comparing models is:

  1. Define candidate models - pick $N$ (2-4?) algorithms appropriate for your data
  2. Compare on the same metric - F1, AUC, or whatever matters for your problem; Choose a clear criteria for establishing a winner ahead of time, ideally a single “champion” metric but could also be a more complex ruleset (e.g., best on metric A with metric B > minimum)
  3. Use the same data splits - ensures fair comparison across the same data splits
  4. For each fold (split n of k):
    1. (optional) Hyperparameter tuning - vary model settings (e.g., tree depth)
    2. Train each model on identical training sets
    3. Evaluate each model on identical validation sets
    4. Record performance metrics for this fold
  5. Compare models and select best - usually best mean score for champion metric, but watch out for models with high variance across folds
  6. Retrain selected model on full train+validation data
  7. Final evaluation - evaluate the chosen model on the held-out test set (touched only once)
flowchart TB
    A[Setup: Define models, metric, hold out test, split into k folds]

    A --> F1 & F2 & Fdots & Fk

    subgraph FoldRow[" "]
        direction LR
        F1[Fold 1]
        F2[Fold 2]
        Fdots[...]
        Fk[Fold k]
    end

    F1 --> LR & RF & XGB

    subgraph Fold1Box[" "]
        LR[Logistic Regression] --> LR1[Train] --> LR2[Eval] --> LR3[Record]
        RF[Random Forest] --> RF1[Train] --> RF2[Eval] --> RF3[Record]
        XGB[XGBoost] --> XGB1[Train] --> XGB2[Eval] --> XGB3[Record]
    end

    F2 -.-> R2[...]
    Fdots -.-> Rdots[...]
    Fk -.-> Rk[...]

    LR3 & RF3 & XGB3 --> Avg[Average scores per model]
    R2 & Rdots & Rk -.-> Avg
    Avg --> Select[Select best model]
    Select --> Retrain[Retrain best on ALL train+val]
    Retrain --> Final[Evaluate on test set]

    style FoldRow fill:none,stroke:none

Code Snippet: Comparing Multiple Models

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'XGBoost': XGBClassifier(n_estimators=100, verbosity=0)
}

# Use same CV splitter for fair comparison
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Compare models
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

LIVE DEMO

Quick Supervised Model Overview

Let’s look at a few tools that you should get a lot of use out of:

Logistic Regression

Logistic regression works similarly to linear regression but uses a sigmoid curve that squeezes our straight line into an S-curve. Note that logistic regression actually “fits” by training linear-like terms inside a more complex function.

Linear vs logistic regression

Additionally, it uses log loss (also called cross-entropy loss: a cost that measures how far predicted probabilities are from the true labels and penalizes confident wrong predictions more than uncertain ones) in place of our usual mean-squared error cost function. This provides a convex curve for approximating variable weights using gradient descent.

Approximation optimization

Reference Card: LogisticRegression

Component Details
Function sklearn.linear_model.LogisticRegression()
Purpose Linear model for classification (binary or multinomial)
Key Parameters penalty: Regularization (‘l2’, ‘l1’, ‘elasticnet’)
C: Inverse regularization strength (smaller = stronger)
solver: Optimization algorithm
max_iter: Max iterations for convergence
Key Methods .fit(), .predict(), .predict_proba()

Code Snippet: Logistic Regression

from sklearn.linear_model import LogisticRegression
X = [[1, 2], [2, 3], [3, 4]]
y = [0, 1, 0]
model = LogisticRegression().fit(X, y)
print(model.predict([[2, 2]]))  # Predicts class label

Random Forest

Each of the steps can be tweaked, but the general flow goes:

  1. Bagging (Bootstrap AGGregating) - create k random samples (with replacement) from the dataset
  2. Grow trees - individual decision trees are constructed by choosing the best features and cutpoints to separate the classes
  3. Classify - instances are run through all trees and assigned a class by majority vote

Reference Card: RandomForestClassifier

Component Details
Function sklearn.ensemble.RandomForestClassifier()
Purpose Ensemble of decision trees for classification
Key Parameters n_estimators: Number of trees (default 100)
max_depth: Maximum tree depth
min_samples_split: Min samples to split a node
max_features: Features to consider per split
Attributes .feature_importances_ - importance of each feature

Code Snippet: Random Forest

from sklearn.ensemble import RandomForestClassifier

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]
model = RandomForestClassifier(n_estimators=10).fit(X, y)
print(model.predict([[2, 2]]))

XKCD Ensemble Model

XGBoost

XGBoost stands for Extreme Gradient Boosting. Like other tree algorithms, XGBoost considers each instance with a series of if statements, resulting in a leaf with associated class assignment scores. Where XGBoost differs is that it uses gradient boosting to focus on weak-performing areas of the previous tree.

XGBoost diagram

Reference Card: XGBClassifier

Component Details
Function xgboost.XGBClassifier()
Purpose Gradient boosting implementation for classification
Key Parameters n_estimators: Number of boosting rounds
learning_rate: Step size shrinkage (eta)
max_depth: Maximum tree depth
subsample: Training instance subsample ratio
Install pip install xgboost

Code Snippet: XGBoost

import xgboost as xgb

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]
model = xgb.XGBClassifier(n_estimators=10).fit(X, y)
print(model.predict([[2, 2]]))

LightGBM: A Faster Alternative

LightGBM (Light Gradient Boosting Machine) is Microsoft’s gradient boosting framework, optimized for speed and memory efficiency. It’s often faster than XGBoost, especially on large datasets.

Key differences from XGBoost:

Reference Card: LGBMClassifier

Component Details
Function lightgbm.LGBMClassifier()
Purpose Fast gradient boosting for classification
Key Parameters n_estimators: Number of boosting rounds
learning_rate: Step size shrinkage
num_leaves: Max number of leaves per tree
max_depth: Limit tree depth (-1 for no limit)
Install pip install lightgbm

Code Snippet: LightGBM

from lightgbm import LGBMClassifier

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]
model = LGBMClassifier(n_estimators=10, verbosity=-1).fit(X, y)
print(model.predict([[2, 2]]))

When to use which?

Deep Learning

Deep learning is a subfield of machine learning that uses artificial neural networks with multiple layers to learn complex patterns from data. These models use back-propagation to adjust the weights in each layer during training, allowing them to model very large and complex datasets.

Deep learning models are especially useful for handling large datasets with high dimensionality, and they can be used for both supervised and unsupervised learning tasks. However, they often require a large amount of data and computation power to train effectively.

More on neural networks later in the course…

Reference Card: Keras Sequential Model

Component Details
Sequential() Linear stack of neural network layers
Dense(units, activation) Fully connected layer
model.compile() Configure for training (optimizer, loss, metrics)
Common Activations ‘relu’, ‘sigmoid’, ‘softmax’, ‘tanh’

Code Snippet: Simple Neural Network

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(8, activation='relu', input_shape=(2,)),
    layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=10)

Unsupervised Models (Further Reading)

Unsupervised models are used when you don’t have labeled data. While this course focuses on supervised classification, it’s worth knowing what’s available:

unsupervised

LIVE DEMO

How Models Fail

Even the best models can fail if the data is messy, the problem is hard, or the world changes. Common failure modes include:

Labeling

Oh, labeling…

Labeling error

Labeling issues can arise when the data is not labeled correctly or consistently, which can lead to biased or inaccurate models. Examples of labeling issues include:

Fit

A model may fail to fit the data in one of two ways: under-fitting or over-fitting:

Overfitting

XKCD Curve Fitting

Dataset Shift

Dataset shift occurs when the distribution of the data changes between the training and test sets. Dataset shift can be divided into three types:

  1. Covariate Shift: A change in the distribution of the independent variables between the training and test sets.
  2. Prior Probability Shift: A change in the distribution of the target variable between the training and test sets.
  3. Concept Shift: A change in the relationship between the independent and target variables between the training and test sets.

Training data (Q2 2017): “Fidget spinners are the future!”

Fidget spinner hype

Production data (2018+): “…nevermind”

Fidget spinner crash

Simpson’s Paradox

Simpson’s paradox occurs when a trend appears in several different groups of data, but disappears or reverses when these groups are combined. It is a common problem in statistics and machine learning that can occur when there are confounding variables that affect the relationship between the independent and dependent variables.

Simpson's paradox

Troublesome Classes

Certain classes or categories in a dataset may be more difficult to classify accurately than others. This can be due to imbalanced class distribution, noisy data, or other factors. Identifying and addressing troublesome classes is an important step in building effective classification models.

Confused classes

XKCD Extrapolating

Model Interpretation

Understanding why a model makes its predictions is crucial in health data science—especially when decisions impact patient care.

what the hell is this

SHAP Values for Feature Importance

SHAP (SHapley Additive exPlanations) assigns each feature an importance value for a particular prediction, based on cooperative game theory. SHAP provides more rigorous explanations than simpler methods but takes longer to compute.

Reference Card: SHAP

Component Details
TreeExplainer(model) Explainer optimized for tree-based models
explainer.shap_values(X) Compute SHAP values for data
summary_plot() Visualize global feature importance
dependence_plot() Show feature interactions
Install pip install shap

Code Snippet: SHAP

import shap
import xgboost as xgb
import pandas as pd

X = pd.DataFrame({'age': [50, 60], 'bp': [120, 140]})
y = [0, 1]
model = xgb.XGBClassifier().fit(X, y)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")

SHAP summary plot example

Example SHAP summary plot: Each dot shows a feature’s impact on a prediction. Color indicates feature value (red=high, blue=low).

SHAP dependence plot

Example SHAP dependence plot: Shows how the effect of one feature depends on the value of another feature.

eli5 for Model Inspection

eli5 is a Python library that helps demystify machine learning models by showing feature weights and decision paths.

Reference Card: eli5

Component Details
show_weights() Display feature importances (for notebooks/HTML)
explain_weights() Return explanation object for programmatic use (e.g. format_as_html())
show_prediction() Explain a single prediction
Key Parameters estimator: Trained model
feature_names: List of feature names
top: Number of top features to show
Install pip install eli5

Code Snippet: eli5

import eli5
from sklearn.ensemble import RandomForestClassifier

X = [[1, 2], [3, 4], [5, 6]]
y = [0, 1, 0]
model = RandomForestClassifier().fit(X, y)

eli5.show_weights(model, feature_names=['feature1', 'feature2'])

eli5 feature weights

eli5 prediction explanation

Practical Data Preparation

Preparing your data is just as important as choosing the right model. Good data prep can make or break your results—especially with real-world health data, which is often messy, imbalanced, and full of categorical variables.

Feature Scaling with StandardScaler

Many algorithms (especially logistic regression, SVMs, and neural networks) perform better when features are on similar scales. StandardScaler transforms features to have mean=0 and standard deviation=1.

Reference Card: StandardScaler

Component Details
Function sklearn.preprocessing.StandardScaler()
Purpose Standardize features by removing the mean and scaling to unit variance
Key Methods fit(X): Compute mean and std from training data
transform(X): Apply standardization
fit_transform(X): Fit and transform in one step
inverse_transform(X): Reverse the transformation
When to Use Logistic regression, SVM, neural networks, K-means, PCA

Code Snippet: StandardScaler

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data with different scales
X = np.array([[50, 180000], [65, 220000], [35, 150000]])  # Age, Income

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Original:\n", X)
print("Scaled:\n", X_scaled)
# Each column now has mean≈0, std≈1

LabelEncoder for Target Variables

Some classifiers (like XGBoost) require target labels to be consecutive integers starting from 0. LabelEncoder transforms arbitrary labels like [5, 7, 9] into [0, 1, 2].

Reference Card: LabelEncoder

Component Details
Function sklearn.preprocessing.LabelEncoder()
Purpose Encode target labels as integers 0, 1, 2, …
Key Methods fit(y): Learn label mapping
transform(y): Apply encoding
fit_transform(y): Fit and transform in one step
inverse_transform(y): Convert back to original labels
When to Use XGBoost with multi-class labels, or when algorithms require 0-indexed labels

Code Snippet: LabelEncoder

from sklearn.preprocessing import LabelEncoder

# Original labels aren't 0-indexed
y = [5, 7, 9, 5, 7, 9]

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
print(y_encoded)  # [0, 1, 2, 0, 1, 2]

# To decode back:
y_original = encoder.inverse_transform(y_encoded)
print(y_original)  # [5, 7, 9, 5, 7, 9]

OneHotEncoder for Categorical Variables

Many machine learning models require all input features to be numeric. One-hot encoding transforms categorical variables (like “smoker” or “blood type”) into a set of binary columns.

onehotencoder

Reference Card: OneHotEncoder

Component Details
Function sklearn.preprocessing.OneHotEncoder()
Purpose Encode categorical features as binary columns
Key Parameters categories: Categories per feature (‘auto’)
drop: Drop category to avoid multicollinearity
sparse_output: Return sparse matrix or array
handle_unknown: How to handle unknown categories

Code Snippet: OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'smoker': ['yes', 'no', 'no', 'yes']})
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['smoker']])
print(encoded)

Handling Imbalanced Classes with Evaluation Metrics

Consider metrics that penalize misclassifications unequally, like:

Handling Imbalanced Data with SMOTE

In health data, one class (like “disease present”) is often much rarer than the other. SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples of the minority class to balance the dataset.

smote

Reference Card: SMOTE

Component Details
Function imblearn.over_sampling.SMOTE()
Purpose Oversample minority class with synthetic samples
Key Parameters sampling_strategy: Target class distribution
k_neighbors: Neighbors used for interpolation
random_state: For reproducibility
Install pip install imbalanced-learn

Code Snippet: SMOTE

from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
import collections

X, y = make_classification(n_samples=100, weights=[0.9, 0.1], random_state=42)
print("Original:", collections.Counter(y))

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print("Resampled:", collections.Counter(y_resampled))

imbalance

Feature Engineering

Good features often matter more than the choice of algorithm. Feature engineering combines domain knowledge with data transformation to create inputs that help models learn.

Domain-Specific Feature Derivations

The best features often come from domain knowledge—knowing what matters in health data:

Automated Feature Engineering (Further Reading)

For complex relational datasets, libraries like featuretools can automatically generate features using Deep Feature Synthesis. This is especially useful for time series and multi-table data.

When and How to Combine Techniques

Often, you’ll need to use several data prep techniques together. The order is crucial to prevent data leakage—when information from the test set accidentally influences training, leading to overly optimistic performance estimates.

Recommended Order (opinionated):

  1. Split data FIRST: Hold out the test set before any preprocessing. Use stratify=y to preserve class ratios.
  2. Encode categorical variables: Fit encoder only on training data, then transform both train and validation/test
  3. Engineer additional features:
    • Always use subject matter expertise to transform or combine conceptually consistent columns that may be more predictive or better fit (e.g., linear) for your model
    • Use dimension reduction techniques like Principal Component Analysis to combine columns for better predictive value and to reduce multicollinearity
    • Consider automated feature engineering as an exploratory step but beware that it is a kind of modeling and, just like any model, it is susceptible to overfitting. Also, you’re going to have to explain whatever you did and the relationships in your analysis.
  4. Scale/normalize features (if needed): Fit scaler only on training data, then transform both
  5. Balance classes (optional, e.g., SMOTE): Apply only to training set—never validation or test!

Why split first? If you fit your scaler or encoder on the full dataset before splitting, your model “sees” test data statistics during training. This leaks information and makes your model appear better than it really is.

If Not Balancing: Use evaluation metrics that account for imbalance:

LIVE DEMO