it's a hands-on learning exercise to deeply understand how attention, positional encoding, encoder-decoder stacks, and beam search work together.
English-to-Moroccan-Darija by BounharAbdelaziz on HuggingFace — ~16k parallel sentence pairs.
- 📄 Attention Is All You Need — Vaswani et al., 2017 (the original Transformer paper)
- 🎥 Coding a Transformer from Scratch — 3Blue1Brown
flowchart TB
subgraph Input
EN["English Sentence"]
DA["Darija Sentence (shifted right)"]
end
subgraph Encoder["Encoder (×3 Layers)"]
direction TB
EE["Token Embedding + Positional Encoding"]
EL1["Encoder Layer 1"]
EL2["Encoder Layer 2"]
EL3["Encoder Layer 3"]
EN_NORM["Final LayerNorm"]
EE --> EL1 --> EL2 --> EL3 --> EN_NORM
end
subgraph Decoder["Decoder (×3 Layers)"]
direction TB
DE["Token Embedding + Positional Encoding"]
DL1["Decoder Layer 1"]
DL2["Decoder Layer 2"]
DL3["Decoder Layer 3"]
DN_NORM["Final LayerNorm"]
FC["Linear → Vocab"]
DE --> DL1 --> DL2 --> DL3 --> DN_NORM --> FC
end
EN --> EE
DA --> DE
EN_NORM -- "Encoder Output" --> DL1
EN_NORM -- "Encoder Output" --> DL2
EN_NORM -- "Encoder Output" --> DL3
FC --> OUT["Output Probabilities"]
flowchart TB
X_IN["Input x"] --> LN1["LayerNorm"]
LN1 --> SA["Multi-Head Self-Attention"]
SA --> DROP1["Dropout (0.3)"]
DROP1 --> ADD1(("+"))
X_IN --> ADD1
ADD1 --> LN2["LayerNorm"]
LN2 --> FFN["FFN (256 → 512 → 256)"]
FFN --> DROP2["Dropout (0.3)"]
DROP2 --> ADD2(("+"))
ADD1 --> ADD2
ADD2 --> X_OUT["Output"]
style ADD1 fill:#4CAF50,color:#fff
style ADD2 fill:#4CAF50,color:#fff
flowchart TB
X_IN["Input x"] --> LN1["LayerNorm"]
LN1 --> MSA["Masked Self-Attention"]
MSA --> DROP1["Dropout (0.3)"]
DROP1 --> ADD1(("+"))
X_IN --> ADD1
ADD1 --> LN2["LayerNorm"]
LN2 --> CA["Cross-Attention (Q=dec, K/V=enc)"]
CA --> DROP2["Dropout (0.3)"]
DROP2 --> ADD2(("+"))
ADD1 --> ADD2
ADD2 --> LN3["LayerNorm"]
LN3 --> FFN["FFN (256 → 512 → 256)"]
FFN --> DROP3["Dropout (0.3)"]
DROP3 --> ADD3(("+"))
ADD2 --> ADD3
ADD3 --> X_OUT["Output"]
ENC_OUT["Encoder Output"] -.-> CA
style ADD1 fill:#4CAF50,color:#fff
style ADD2 fill:#4CAF50,color:#fff
style ADD3 fill:#4CAF50,color:#fff
style ENC_OUT fill:#2196F3,color:#fff
flowchart LR
Q["Q"] --> WQ["W_q Linear"]
K["K"] --> WK["W_k Linear"]
V["V"] --> WV["W_v Linear"]
WQ --> SPLIT_Q["Split into 4 Heads"]
WK --> SPLIT_K["Split into 4 Heads"]
WV --> SPLIT_V["Split into 4 Heads"]
SPLIT_Q --> ATTN["Scaled Dot-Product\nAttention (×4)"]
SPLIT_K --> ATTN
SPLIT_V --> ATTN
ATTN --> CONCAT["Concat Heads"]
CONCAT --> WO["W_o Linear"]
WO --> OUT["Output"]
flowchart LR
CSV["CSV Dataset\n~16k pairs"] --> CLEAN["Clean\n• lowercase\n• filter ≥50 words"]
CLEAN --> TOK["BPE Tokenizer\nvocab=5000"]
TOK --> DL["DataLoader\nbatch=32"]
DL --> MODEL["Tiny Transformer\n3L / 256d / 512ff"]
MODEL --> LOSS["CrossEntropyLoss\nlabel_smoothing=0.1"]
LOSS --> OPT["Adam + OneCycleLR\nmax_lr=0.0007"]
OPT --> AMP["Mixed Precision\nGradScaler"]
AMP -->|20 epochs| MODEL
style CLEAN fill:#FF9800,color:#fff
style MODEL fill:#9C27B0,color:#fff
style AMP fill:#2196F3,color:#fff
flowchart TB
IN["English Input"] --> ENC["Encode (frozen)"]
ENC --> BEAM["Beam Search (k=3)"]
BEAM --> B1["Beam 1: score=-1.2"]
BEAM --> B2["Beam 2: score=-1.5"]
BEAM --> B3["Beam 3: score=-2.1"]
B1 --> EXPAND["Expand top-k tokens\nper beam"]
B2 --> EXPAND
B3 --> EXPAND
EXPAND --> PRUNE["Keep top 3 beams\nby cumulative log-prob"]
PRUNE -->|"repeat until </s>"| BEAM
PRUNE --> BEST["Best Translation"]
style BEST fill:#4CAF50,color:#fff
| Parameter | Value |
|---|---|
D_MODEL |
256 |
D_FF |
512 |
N_HEAD |
4 (64 dims/head) |
NUM_LAYERS |
3 |
dropout |
0.3 |
VOCAB_SIZE |
5000 (shared BPE) |
MAX_LEN |
256 |
| LayerNorm | Pre-LN |
| Setting | Value |
|---|---|
| Optimizer | Adam (β₁=0.9, β₂=0.98) |
| Scheduler | OneCycleLR |
max_lr |
0.0007 |
label_smoothing |
0.1 |
| Epochs | 20 |
| Batch Size | 32 |
| Mixed Precision | FP16 via torch.amp |
| Gradient Clipping | 1.0 |
- Open
Transformer.ipynbin Google Colab (T4 GPU) - Upload
train-00000-of-00001.csvto the runtime - Run all cells sequentially (Cell 1 → 6)
- Test translations in Cell 5:
translate_beam("How are you?", model, tokenizer, device)├── Transformer.ipynb # Main notebook (6 cells)
├── train-00000-of-00001.csv # English-Darija dataset (~16k rows)
├── tokenizer/ # Saved BPE tokenizer files
│ ├── vocab.json
│ └── merges.txt
├── model.pth
└── README.md # This file