Transformers: More than Meets the Eye

hw07 #FIXME:URL

From Neural Networks to Transformers

The Scale-Up Era (2018–)

ELMo (2018): contextualized word embeddings — same word, different vector depending on context
BERT (2018): bidirectional training; dominated NLP benchmarks
GPT (2018): unidirectional, predict-the-next-token — the design behind all modern LLMs
T5 (2019): every NLP task as text-to-text
GPT-3 (2020): 175B parameters; few-shot learning from examples in the prompt alone
ChatGPT (2022): GPT-3.5 + RLHF; 100M users in two months
Open-weight models (2023–): Llama, Mistral — competitive models you can run locally
Reasoning models (2024–): o1/o3, DeepSeek-R1 — chain-of-thought at inference time
Agentic AI (2025–): models that use tools, write code, and orchestrate multi-step workflows

Transformer Architecture

The Problem: Processing Everything at Once

Transformers process the full sequence in parallel, but that creates a new problem: how does any token know about any other token? The answer is attention.

The original transformer uses an encoder-decoder structure:

Encoder: reads the entire input, builds a rich representation
Decoder: uses that representation to generate output one token at a time
Both are stacks of 6 identical layers (same structure, different learned weights)
Pipeline: Tokenize → Embed → Add positional encodings → Stack attention layers → Generate output

Modern LLMs have largely converged on a decoder-only design. It turns out you don’t need a separate “understanding” step. Instead of encode-then-decode, concatenate everything: context, question, partial answer. Then train a single decoder stack to predict the next token.

Self-Attention: Letting Tokens Talk

“The animal didn’t cross the street because it was too tired.” — what does “it” refer to?
“The doctor told the nurse that she would handle the next patient after she finished the paperwork.” Which “she” is which?
“The trophy didn’t fit in the suitcase because it was too big.” — is “it” the trophy or the suitcase?

These ambiguities are trivial(ish) for humans but require the model to weigh every token’s relationship to every other token simultaneously. Self-attention does exactly that — each token computes how much it should attend to every other token, resolving these references in a single step.

How It Works: Query, Key, Value

For each token, the model creates three vectors from learned weight matrices:

Query (Q): What this token is looking for
Key (K): What this token offers to others
Value (V): The actual content to retrieve

For a 3-token input — “cat,” “sat,” “mat” — each embedding is multiplied by learned weight matrices $W_Q$, $W_K$, $W_V$ to produce Q, K, V vectors. From the perspective of “cat”:

Score: Compute the dot product of $Q_\text{cat}$ against every token’s Key:
- $Q_\text{cat} \cdot K_\text{cat} = 112$, $Q_\text{cat} \cdot K_\text{sat} = 96$, $Q_\text{cat} \cdot K_\text{mat} = 78$
Scale: Divide by $\sqrt{d_k} = \sqrt{4} = 2$: scores become $56, 48, 39$
Softmax (convert scores to probabilities summing to 1): $[0.73, 0.22, 0.05]$ — “cat” attends mostly to itself and somewhat to “sat”
Weighted sum: $0.73 \cdot V_\text{cat} + 0.22 \cdot V_\text{sat} + 0.05 \cdot V_\text{mat}$ — a new representation of “cat” that blends information from the whole sequence

Repeat for every token. That’s self-attention.

Code Snippet: Simplified Attention

The function below implements the core attention calculation in pure numpy. It takes query, key, and value matrices, computes scaled dot-product scores between all pairs of tokens, normalizes them with softmax to get attention weights, then uses those weights to blend the value vectors into context-aware representations.

import numpy as np

def scaled_dot_product_attention(query, key, value):
    """Compute scaled dot-product attention (pure numpy)."""
    d_k = query.shape[-1]
    scores = query @ key.T / np.sqrt(d_k)
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)  # softmax
    return weights @ value

Multi-Head Attention

Language has many simultaneous relationships — syntax, semantics, entity references, temporal ordering. Multi-head attention runs multiple attention operations in parallel, each with its own learned Q/K/V matrices, so each head can specialize.

8 heads, 512-dim embeddings → 64 dims per head
Results are concatenated and projected back to full dimension

Putting It Together

How training works:

The encoder reads the source sequence; the decoder generates the target one token at a time.
Cross-attention bridges the two — in the architecture diagram, it’s the middle attention block in each decoder layer where the decoder’s queries attend to the encoder’s keys and values.

This is how the decoder “reads” the input: it asks “given what I’ve generated so far, what parts of the input should I focus on next?” Cross-entropy loss measures prediction error, gradients flow back, and the Adam optimizer updates weights.

Repeat over billions of examples…

Reference Card: Transformer Components

Component	What Problem It Solves	Details
Input Embedding	Discrete tokens → continuous space	Maps each token to a dense vector the network can process
Positional Encoding	Attention is order-agnostic	Injects position information so the model can distinguish word order
Multi-Head Attention	Single attention can’t specialize	Each head focuses on different aspects (syntax, semantics, entity references)
Cross-Attention	Decoder needs to read the input	Decoder queries attend to encoder keys/values — “what did the input say?”
Feed-Forward Network	Attention blends but can’t transform	Two-layer network (expand 4x, activate, contract) applied at each position
Layer Normalization	Deep networks have unstable signals	Rescale activations to mean=0, variance=1 within each layer
Residual Connections	Deep networks have vanishing gradients	Skip connections create gradient highways through the full stack
Masking	Decoder can’t peek at future tokens	Sets future positions to $-\infty$ before softmax

Beyond Text

Vision Transformers (ViT): images split into patches, each patch treated as a token
Time-series: EHR data, sensor readings, financial sequences
Protein structure: AlphaFold uses attention over amino acid sequences
Multimodal models: GPT-4o, Gemini, Claude process text, images, and audio together
Clinical EHR modeling: sequences of diagnosis codes, medications, and lab values over time — each event is a “token” and attention learns which prior events matter for predicting outcomes
Code and version history: git diffs, code completion (Copilot, Cursor), and automated code review all use transformer architectures
Music and audio: Whisper (speech-to-text), Jukebox (music generation) treat audio spectrograms as sequences

Building a GPT from Scratch

A working GPT in ~200 lines of Python — Karpathy’s microGPT.

Band	What It Does
Autograd Engine (orange)	Gradient-tracking machinery that powers backpropagation
Input	Raw text → characters → integer token IDs
Embeddings	Token embedding + position embedding (input embedding + positional encoding)
Normalization	Layer norm (RMSNorm) — the “Add & Norm” pattern
Transformer Block (×`n_layer`)	Multi-head self-attention (4 heads × 16 dims) → MLP (feed-forward) with residual connections
Output Head	Linear projection from embedding dim → vocabulary size (27 chars)
Prediction	Softmax → next-token probabilities
Training	Cross-entropy loss (how wrong?) → backprop → Adam optimizer updates weights
Inference	Sample from probability distribution; temperature controls randomness

Scaling to GPT-4 changes the tokenizer, the data (terabytes), and the compute (thousands of GPUs) — but the core algorithm is the same.

LIVE DEMO!

Embeddings

Embeddings map discrete tokens to continuous vectors where meaning is geometry. Similar items cluster together; relationships become directions. Every layer of a transformer produces embeddings — they’re the model’s internal representation of meaning. LLMs like GPT-4 produce rich, high-dimensional embeddings internally, but for practical tasks like search and comparison we typically use smaller, purpose-built models (like Sentence Transformers) because their embeddings are compact enough to store and compare at scale. These representations emerge through training in a self-organizing, unsupervised manner — no one labels which words should be near each other; the geometry arises from patterns in the data.

The idea generalizes beyond text — recommendation systems, drug interactions, diagnostic codes, and categorical variables can all be embedded.

Key applications: semantic search, document clustering, similarity matching, anomaly detection, classification features.

Reference Card: Common Embedding Methods

Method	Type	Key Characteristic
Word2Vec	Word-level, static	Learned from co-occurrence; fast to train
GloVe (Global Vectors)	Word-level, static	Factorizes co-occurrence matrix; similar to Word2Vec
FastText	Subword-level, static	Character n-grams handle misspellings and rare words
Sentence Transformers	Sentence-level, contextual	Same word gets different vectors by context; purpose-built for similarity

Sentence Transformers

Sentence Transformers produce fixed-size vectors for full sentences — contextualized embeddings where “bank” near “river” gets a different vector than “bank” near “money.”

Reference Card: `SentenceTransformer`

Component	Details
Library	`sentence-transformers` (`pip install sentence-transformers`)
Purpose	Generate dense vector embeddings for sentences/paragraphs
Key Method	`model.encode(sentences)` — returns numpy array of embeddings
Popular Models	`all-MiniLM-L6-v2` (fast), `all-mpnet-base-v2` (accurate)
Output	Fixed-size vectors (e.g., 384 or 768 dimensions)

Code Snippet: SentenceTransformer

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Patient presents with chest pain",
    "Acute myocardial infarction suspected",
    "Scheduled for routine dental cleaning",
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 384) — three sentences, 384 dimensions each

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude.

Reference Card: `cosine_similarity`

Component	Details
Function	`sklearn.metrics.pairwise.cosine_similarity()`
Purpose	Measure similarity between vectors (1 = identical, 0 = orthogonal, -1 = opposite)
Input	Two arrays of shape (n_samples, n_features)
Use Case	Compare embeddings to find semantically similar texts

Code Snippet: Computing and Comparing Embeddings

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

# Clinical documents
docs = [
    "Patient presents with chest pain and shortness of breath",
    "Lab results show elevated troponin levels",
    "Patient reports headache and nausea",
]

embeddings = model.encode(docs)

# Find most similar to a query
query_emb = model.encode(["cardiac symptoms"])
similarities = cosine_similarity(query_emb, embeddings)[0]

for doc, sim in sorted(zip(docs, similarities), key=lambda x: -x[1]):
    print(f"{sim:.3f}  {doc}")

Vector Databases

Stores and indexes embedding vectors for fast similarity search at scale.

Reference Card: Vector Database Options

Database	Type	Strengths
ChromaDB	In-memory/persistent	Simple API, good for prototyping
FAISS (Facebook AI Similarity Search)	In-memory	Fast, scalable, from Meta AI
Pinecone	Cloud service	Managed, production-ready
Weaviate	Self-hosted/cloud	Full-text + vector search
pgvector	PostgreSQL extension	Integrate with existing DB

Code Snippet: Vector Database with ChromaDB

import chromadb

client = chromadb.Client()
collection = client.create_collection("clinical_notes")

# Add documents (ChromaDB handles embedding automatically)
collection.add(
    documents=["Patient has type 2 diabetes", "Elevated troponin, chest pain"],
    ids=["note1", "note2"]
)

# Query by semantic similarity
results = collection.query(query_texts=["cardiac symptoms"], n_results=1)
print(results["documents"])  # [['Elevated troponin, chest pain']]

General Models → Getting the Details Right

LLMs are general-purpose — the same model translates, summarizes, classifies, writes code, and reasons. No custom pipeline needed per task. Open-source and open-weight models (Llama, Mistral, DeepSeek) now match or exceed what was state-of-the-art just a year ago — models that would cost millions to train from scratch are freely available as starting points. The practical question isn’t “how do I build a model?” but “how do I get an existing model to do what I need?”

Two approaches to go from a general model to your specific task:

Approach	When to Use	Effort	Cost
Prompting (recommended default)	Most tasks; fast iteration	Minutes to test	Lower
Fine-tuning (specialized cases)	Specialized vocabulary, domain patterns	Days–weeks	Higher

Fine-Tuning

Continue training a pre-trained model on your domain data. Save it for specialized vocabulary or patterns (e.g., pathology report terminology) where you have hundreds+ labeled examples.

Reference Card: `Trainer`

Component	Details
Signature	`Trainer(model, args, train_dataset, eval_dataset=None, data_collator=None)`
Purpose	High-level training loop that handles batching, optimization, logging, and checkpointing for fine-tuning pre-trained models.
Parameters	• model: A pre-trained `AutoModel` instance (e.g., `GPT2LMHeadModel`). • args (`TrainingArguments`): Configures output dir, epochs, batch size, learning rate, etc. • train_dataset (`Dataset`): Tokenized training data in Hugging Face `Dataset` format. • eval_dataset (`Dataset`, optional): Evaluation data for metrics during training.
Returns	`TrainOutput` with training loss and metrics. Call `trainer.train()` to start.

Code Snippet: Fine-Tuning a GPT

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Tokenize and wrap in a Dataset (Trainer requires this format)
texts = ["Clinical notes about diabetes management", "More clinical text about hypertension"]
tokenized = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
tokenized["labels"] = tokenized["input_ids"].clone()
dataset = Dataset.from_dict({k: v.tolist() for k, v in tokenized.items()})

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
)

trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()

Making Fine-Tuning Practical

Full fine-tuning updates every weight in the model — expensive and often unnecessary. Several strategies reduce cost and tailor the model to your domain:

Layer freezing: Lock early layers (which learn general language features) and only train the later, task-specific layers. Fewer trainable parameters means faster training, less memory, and lower overfitting risk on small datasets.
Head replacement: Remove the final layers (closest to the output) from a pre-trained model and replace them with fresh layers trained on your domain data. The frozen base acts as a feature extractor — it already “understands” language — while the new output layers learn your specific task. This is the most common transfer-learning pattern in practice.
Adapter methods (LoRA, QLoRA): Insert small trainable modules into the frozen model. LoRA adds low-rank weight matrices (~1–5% of original parameters) that learn the domain-specific “delta.” The base model stays untouched, so you can swap adapters for different tasks without retraining from scratch.
Pruning: Remove redundant weights or attention heads after training to shrink the model for deployment. Useful when you need inference speed on limited hardware without sacrificing much accuracy.

In practice, most teams start with prompting, move to head replacement or LoRA if needed, and rarely do full fine-tuning unless they have substantial compute and data.

Hallucination

No general solution. The model confidently generates plausible-sounding text that may be completely wrong.

Mitigations (none foolproof):

RAG (Retrieval-Augmented Generation): ground responses in actual documents (Lecture 8)
Prompt and output design: structured outputs, schema enforcement, require citations
Human-in-the-loop: expert review, especially for high-stakes decisions

LIVE DEMO!!

Prompt Engineering

“Programming” the model without retraining. Every prompt has the same building blocks:

[ROLE] Who the model should act as [TASK] What needs to be done [FORMAT] How to structure the output [CONSTRAINTS] Boundaries and requirements [EXAMPLES] Concrete input/output pairs

Reference Card: Prompting Techniques

Technique	Description	When to Use
Zero-shot	Task description only, no examples	Simple, well-defined tasks
One-shot	Single example provided	When pattern is clear from one case
Few-shot	2–5 examples provided	Complex patterns, structured output
Chain-of-thought	Ask model to show reasoning step-by-step before answering	Multi-step reasoning tasks (expanded in Lecture 8)
Explicit structure	Use XML tags or numbered steps to separate prompt components	Complex prompts with multiple data sources
Grounding	Ask the model to extract relevant quotes before answering	Clinical decision support, traceability required
Self-verification	Ask the model to check its own output before finishing	Structured extraction, high-stakes tasks
Document ordering	Place documents at top, questions at bottom	Multi-document analysis (20K+ tokens)

Zero-Shot, One-Shot, and Few-Shot Learning

Zero-shot: task description only, no examples — works for simple, well-defined tasks
One-shot: single example establishes the pattern
Few-shot: 2–5 examples — needed for complex output formats or domain-specific conventions

The more structured the task, the more examples help.

Example: Few-Shot Prompting

Extract diagnoses from clinical notes.

Example 1: Note: “Patient presents with elevated blood glucose and polyuria.” Diagnosis: Type 2 Diabetes Mellitus

Example 2: Note: “Chest pain radiating to left arm, elevated troponin.” Diagnosis: Acute Myocardial Infarction

Now extract the diagnosis: Note: “Patient has persistent cough, fever, and infiltrates on chest X-ray.” Diagnosis:

System Prompts

Sets the model’s persona, constraints, and default behavior for the entire conversation. System prompts are sent as a separate message role that persists across the conversation.

Example: System Prompt

You are a clinical documentation assistant.

Rules:

Use ICD-10 codes when identifying diagnoses

Flag any findings that need follow-up

Never provide treatment recommendations

Explicit Structure and Grounding

For complex prompts with multiple inputs, use XML tags or clear section markers to separate components. This reduces errors when the model needs to handle instructions, data, and formatting rules simultaneously.

Ask the model to extract and cite relevant quotes from the source material before generating its answer — this “grounds” the response in evidence and reduces hallucination.

Example: Structured Prompt with Grounding

<instructions> Review the clinical note below. First, extract key quotes that support your assessment. Then provide a structured diagnosis. </instructions>

<clinical_note> 65-year-old male with chest pain, ST elevation in leads V1-V4, troponin elevated at 2.5 ng/mL. Cardiology consulted for emergent catheterization. </clinical_note>

<output_format>

Supporting quotes from the note

Primary diagnosis with ICD-10 code

Confidence level (high/medium/low)

</output_format>

Self-Verification and Chain-of-Thought

Ask the model to reason step-by-step before answering (chain-of-thought), or to check its own output before finishing (self-verification). Both improve accuracy on multi-step reasoning tasks.

Example: Chain-of-Thought with Self-Verification

Review this patient’s medication list for interactions. Think through each pair step by step. After completing your analysis, verify that you checked every combination and didn’t miss any.

Medications: metformin, lisinopril, warfarin, aspirin, omeprazole

Prompt Chaining

Break complex tasks into sequential steps where each prompt’s output feeds into the next.

Step 1: Extract medications from clinical note → list Step 2: For each medication, check for interactions → table Step 3: Summarize findings for clinician → report

This is the foundation of agentic workflows (Lecture 8).

Structured Responses

Machine-readable output (JSON, XML, table) instead of free text. Specify the schema in the prompt, validate programmatically.

Reference Card: Structured Output Prompting

Component	Details
Schema Definition	Explicitly define JSON structure in prompt
Required Fields	List all mandatory fields with types
Validation	Parse and validate output programmatically
Fallback	Handle parsing errors gracefully

Example: Schema-Based Prompt

Extract the following information from the clinical note and return it as JSON:
{
  "diagnosis": "<primary diagnosis>",
  "confidence": "<0.0-1.0>",
  "icd_code": "<ICD-10 code if known>",
  "reasoning": "<brief explanation>"
}
Clinical Note: “65-year-old male with chest pain, ST elevation in leads V1-V4, troponin elevated at 2.5 ng/mL. Cardiology consulted for emergent catheterization.”

LLM API Integration

API Access Patterns

REST APIs: HTTP endpoints accepting JSON, returning generated text
SDKs: OpenAI Python, Anthropic SDK; OpenAI-compatible providers (OpenRouter, Together) reuse the same SDK with a different base_url
Authentication: API keys in environment variables or a secrets manager

Code Snippet: OpenAI API

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful medical assistant."},
        {"role": "user", "content": "Summarize: Patient presents with chest pain and elevated troponin."}
    ],
    max_tokens=150
)

print(response.choices[0].message.content)

Code Snippet: OpenRouter (OpenAI-Compatible)

Same openai SDK, different base_url — access models from every major provider.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4",  # or "openai/gpt-4o-mini", etc.
    messages=[
        {"role": "system", "content": "You are a helpful medical assistant."},
        {"role": "user", "content": "Summarize: Patient presents with chest pain and elevated troponin."}
    ],
    max_tokens=150
)

print(response.choices[0].message.content)

Links

Transformers & Attention

Building GPTs

LLMs

Healthcare AI

Prompt Engineering Guides

Where to Play Around