Transformer Architecture

Pre-training and scaling laws

Model architectures and pre-training objectives

Scaling laws and compute-optimal models