What Is Natural Language Processing? From Tokenization to Transformers

What Is Natural Language Processing?

Natural language processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics, focused on enabling computers to understand, interpret, generate, and manipulate human language in all its complexity. Human language — full of ambiguity, context-dependence, implicit meaning, metaphor, and cultural reference — presents challenges that go far beyond simple pattern matching. NLP bridges the gap between the unstructured, messy richness of human communication and the structured, precise operations of computers.

NLP encompasses an enormous range of tasks: translation between languages, summarizing long documents, answering questions about text, classifying sentiment in reviews, extracting structured information (names, dates, relationships) from unstructured text, generating coherent and contextually appropriate text, transcribing and understanding spoken language, and enabling conversational agents. Each of these tasks is non-trivial and has required decades of research and thousands of innovations to approach human-level performance.

Tokenization: Breaking Text into Units

Tokenization is the first step in almost every NLP pipeline: splitting raw text into discrete units (tokens) that the model can process. The choice of tokenization strategy significantly affects model performance.

Word tokenization splits text into words (splitting on whitespace and punctuation). This is intuitive but has several problems: it creates huge vocabularies (hundreds of thousands of word types in a typical English corpus), fails to handle out-of-vocabulary words (rare words, names, technical terms not seen during training), and struggles with morphologically rich languages where the same root word takes many inflected forms.

Character tokenization treats each individual character as a token. This produces a tiny vocabulary (roughly 100 characters for English) and handles any word, but sequences become very long and character-level models must learn word structure from scratch.

Subword tokenization — the dominant approach in modern NLP — strikes a balance by splitting common words into single tokens and rare words into sub-word units. The Byte Pair Encoding (BPE) algorithm, used in GPT models, iteratively merges the most frequent pairs of adjacent tokens, building a vocabulary of commonly occurring character sequences. WordPiece (used in BERT) and SentencePiece are similar approaches. Subword tokenization handles rare words and multiple languages gracefully while keeping sequences manageable. A typical vocabulary size for a large language model is 30,000–100,000 subword tokens.

Word Embeddings: Representing Meaning as Vectors

Traditional NLP represented words as discrete symbols with no inherent relationship to each other — "cat" and "feline" were as different as "cat" and "democracy" from the model's perspective. Word embeddings — dense vector representations of words in a continuous high-dimensional space — capture semantic relationships by placing similar words near each other in the vector space.

The breakthrough came with Word2Vec (Mikolov et al., 2013), which learned embeddings by training a neural network to predict a word from its context (or vice versa) on a massive text corpus. The resulting vectors had remarkable algebraic properties: vector("king") - vector("man") + vector("woman") ≈ vector("queen"). Words used in similar contexts had similar vectors — the distributional hypothesis formalized in these learned representations.

GloVe (Global Vectors for Word Representation) learned embeddings from the global statistics of word co-occurrence in a corpus, rather than from local context windows. Both Word2Vec and GloVe produced static embeddings — each word had one fixed vector regardless of context. This failed to handle polysemy: "bank" (financial institution) and "bank" (river bank) got the same vector even though they have completely different meanings in different contexts.

Modern language models use contextual embeddings: each word's representation depends on the words around it. The same word gets different vector representations in different sentences, capturing the word's meaning in context rather than an average across all uses. This shift, enabled by transformer architectures, has been central to the dramatic progress in NLP performance.

The Transformer Architecture

The transformer, introduced in the landmark paper "Attention Is All You Need" (Vaswani et al., 2017), revolutionized NLP and has since transformed computer vision, biology, and other fields. It replaced recurrent architectures (which processed sequences step by step) with a purely attention-based approach that processes the entire sequence simultaneously.

The core mechanism is self-attention: for each token in the input sequence, the model computes how much "attention" to pay to every other token in the sequence when representing that token. Formally, each token is projected into three vectors — query (Q), key (K), and value (V) — and attention weights are computed as the scaled dot product of the query with all keys (softmaxed to sum to 1). The output representation for each token is the weighted sum of value vectors, weighted by attention scores. This allows every token to directly "see" and be influenced by every other token in the sequence, capturing long-range dependencies that RNNs struggled with.

Multi-head attention runs multiple attention computations in parallel (each with different learned projections), allowing the model to simultaneously attend to different aspects of the context — syntactic relationships, semantic content, coreference — and combine them. The transformer also uses position encodings (since attention is permutation-invariant) and feed-forward layers applied to each position independently.

Transformers can be scaled to enormous sizes by increasing the number of layers, attention heads, and embedding dimensions. This scaling, combined with large training datasets, produces qualitative improvements in capabilities — an empirical finding that has driven the large language model revolution.

BERT and GPT: Two Paradigms

Two transformer-based architectures have defined the modern era of NLP, representing complementary approaches:

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) is an encoder-only transformer trained with two self-supervised objectives on unlabeled text: Masked Language Modeling (MLM — randomly masking 15% of input tokens and predicting them from context, attending bidirectionally to all other tokens) and Next Sentence Prediction (NSP — predicting whether two sentences are consecutive in the original text). BERT produces contextual representations of input text that can be fine-tuned for downstream tasks (classification, named entity recognition, question answering) by adding a small task-specific output layer and training on labeled examples. BERT dramatically improved performance on a wide range of NLP benchmarks when released.

GPT (Generative Pre-trained Transformer, OpenAI) is a decoder-only transformer trained with a simple next-token prediction objective: given a sequence of tokens, predict the next token. This is trained in a causal (left-to-right) manner — each token can only attend to previous tokens. GPT-1 (2018), GPT-2 (2019), GPT-3 (2020, 175 billion parameters), and GPT-4 (2023) demonstrated that scaling decoder-only transformers on vast text corpora produces models capable of few-shot and zero-shot generalization to a remarkable range of tasks — without task-specific fine-tuning, simply by providing a few examples in the prompt.

Key NLP Tasks and Applications

Modern NLP systems tackle a wide range of tasks:

Text classification: Assigning predefined categories to text — sentiment analysis (positive/negative/neutral), topic classification, spam detection. BERT-style models fine-tuned on labeled examples achieve near-human performance on many classification benchmarks.
Machine translation: Translating text from one language to another. Transformer-based neural machine translation (Google's production system switched to transformers in 2017) dramatically improved translation quality, particularly for high-resource language pairs. Models like DeepL and Google Translate handle over 100 language pairs.
Text summarization: Producing shorter, coherent summaries of longer texts. Abstractive summarization (generating new sentences, not just extracting existing ones) was historically very difficult; modern LLMs produce high-quality abstractive summaries across diverse domains.
Question answering: Given a context document and a question, extracting or generating the correct answer. Reading comprehension benchmarks like SQuAD (Stanford Question Answering Dataset) showed BERT-style models achieving near-human performance on extractive QA. Open-domain QA — answering questions without a given context document, relying on knowledge from pre-training — is a current research frontier.
Named entity recognition (NER): Identifying and classifying named entities (people, organizations, locations, dates) in text. Essential for information extraction, knowledge base construction, and downstream applications.
Code generation: Modern LLMs like GitHub Copilot (powered by OpenAI Codex) assist programmers by generating, completing, and explaining code across dozens of programming languages.

Recent Breakthroughs and Future Directions

The period from 2020 to 2025 saw an extraordinary acceleration in NLP capabilities, driven by scaling laws — the empirical observation that model performance improves predictably with model size, dataset size, and compute. GPT-3's few-shot learning capabilities surprised even its creators; GPT-4 and its contemporaries (Claude, Gemini, LLaMA) exhibit sophisticated reasoning, multimodal understanding (combining text and images), and instruction-following abilities that represent a genuine qualitative advance.

Reinforcement Learning from Human Feedback (RLHF) has been crucial for aligning large language models with human preferences — making them more helpful, accurate, and safe. Human raters compare model outputs and provide preference data; a reward model is trained on these preferences; and the language model is fine-tuned via RL to maximize reward. This process — along with supervised fine-tuning on high-quality demonstrations — transformed GPT-3 into the much more useful and aligned ChatGPT.

Current research frontiers include multimodal models that reason over text, images, audio, and video simultaneously; retrieval-augmented generation (RAG) that combines LLMs with real-time information retrieval to reduce hallucination; efficient fine-tuning methods (LoRA, prefix tuning) that adapt large models to specific tasks with minimal compute; and constitutional AI and other alignment techniques for building reliably safe and honest systems. NLP has arguably made more progress in the past decade than in the preceding fifty years — yet the gap between current capabilities and genuine language understanding remains a topic of active debate.

What Is Natural Language Processing? From Tokenization to Transformers