October 2018T1

BERT Released — Bidirectional Transformer Pretraining

Jacob Devlin and colleagues at Google AI posted BERT (Bidirectional Encoder Representations from Transformers) to arXiv. By pretraining a Transformer encoder bidirectionally as a masked-language model, it rewrote scores on GLUE and other NLP benchmarks at a stroke, and established the 'pretrain-then-fine-tune' paradigm that underlies every modern LLM. With the autoregressive GPT family, it forms one of the two great currents of Transformer-based language modelling.

BERT embeddings architecture diagram
SourceDaniel Voigt Godoy (Wikimedia Commons) · CC BY 4.0 · View on Commons

Metadata

Date
October 2018
Decade
2010s
Tier
T1
Sources
05
Connections
01

BERT Released — Bidirectional Transformer Pretraining Rewrites NLP

On 11 October 2018, Jacob Devlin and colleagues at Google AI posted "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" to arXiv (arXiv:1810.04805).

The same day, the paper set new state-of-the-art results on eleven NLP tasks at once. The GLUE score jumped to 80.5% (+7.7 points), SQuAD v1.1 F1 reached 93.2 (above the human score of 91.2), and SQuAD v2.0 F1 reached 83.1 (+5.1 points). The NLP leaderboards were rewritten overnight.

What BERT Is

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only model derived from the 2017 Transformer paper (Vaswani et al.). It is specialised for turning input text into rich vector representations.

Two technical contributions stand out.

1. Bidirectional pretraining. GPT, which had appeared earlier the same year, was a left-to-right autoregressive model—when processing any given word, it could not see context to the right. BERT instead used masked language modelling (MLM): roughly 15% of input tokens are replaced by [MASK], and the model is trained to predict the originals from the surrounding context. Each token therefore attends to both left and right context simultaneously.

2. The "pretrain-then-fine-tune" paradigm. BERT was pretrained on BookCorpus (≈800M words) and English Wikipedia (≈2.5B words), then adapted to downstream tasks—question answering, sentiment, classification—with only light fine-tuning. The era of designing a bespoke architecture per task ended.

Two sizes shipped: BERT-Base (110M parameters, 12 layers) and BERT-Large (340M parameters, 24 layers). Large for the time, but three orders of magnitude smaller than GPT-3 (175B) two years later.

Industry Response

One month after the paper, in November 2018, Google open-sourced the pretrained weights and TensorFlow code on GitHub. Open weights let researchers and companies fine-tune on their own data, and the NLP field moved almost overnight.

Derivatives poured out: RoBERTa (Facebook, July 2019, improved training recipe), DistilBERT (Hugging Face, October 2019, 40% smaller and 60% faster), ALBERT (Google, September 2019, parameter sharing), XLNet, ELECTRA, plus multilingual mBERT, Japanese BERT, biomedical BioBERT, legal LegalBERT, and dozens more. "BERT-family" became standard vocabulary in NLP research.

In October 2019, Google announced that BERT was now running in production search. Initially it affected about 10% of English queries (roughly 560 million queries per day). Google itself called it "the biggest leap forward for Search in the past five years". By December the rollout extended to 70 languages, and within a year BERT was being used on essentially every English query.

"Research to production in a year" was, for the time, extraordinary. It showed a direct pipeline from NLP academia to a service used by billions every day.

Why It Mattered

Before BERT, NLP was task-specific: separate architectures for translation, question answering, sentiment, parsing. The dominant practice was bespoke design plus task-specific data.

BERT demonstrated a single large general-purpose model, pretrained once and fine-tuned cheaply for anything. That design philosophy is the foundation of every modern LLM—GPT-3, ChatGPT, Claude, Gemini—and it was established in 2018.

The contrast with GPT (OpenAI) is also instructive. GPT is decoder-only (strong at generation); BERT is encoder-only (strong at understanding). The field later passed through encoder-decoder designs (T5, BART) before settling on decoder-only plus scale as the dominant line. But across 2018-2022, BERT-family models were the protagonists of NLP.

What Remained

As of 2026, BERT's paper has been cited over 100,000 times—among the most-cited machine-learning papers ever.

Technically, the rise of generative models after ChatGPT (November 2022) reduced BERT's visible footprint. But the deeper premises of modern AI—pretrain-then-fine-tune, Transformer-based general language models, open-weights as the engine that accelerates a whole community—all run on rails BERT laid in 2018.

Four years before ChatGPT astonished the public, the NLP research field had already switched paradigms. BERT was that switch.

Sources

  1. SecondaryBERT (language model) — Wikipedia

    Accessed 2026-05-24

Share