How AI Understands Text: Natural Language Processing for Beginners (Part 6)

AI Fundamentals Series · Part 6 of 10 — Previous: Part 5: Neural Networks for Beginners — Next: Part 7: Computer Vision for Beginners

The Language Problem

Of all the things computers struggle with, language is among the most challenging. Human language is ambiguous, context-dependent, culturally loaded, and constantly evolving. The word “bank” means different things next to “river” and “money.” The sentence “I saw the man with the telescope” has two valid grammatical interpretations. Sarcasm, idiom, and metaphor require cultural context that no dictionary can fully capture.

In Part 5, we established that neural networks operate on numbers. Language, however, is made of symbols — words, characters, punctuation. Natural Language Processing (NLP) is the field devoted to bridging this gap: converting language into numerical representations that neural networks can process, and converting network outputs back into human-readable text.

This article will walk you through the key ideas, from the basics of tokenization to the large-scale language models that power today's AI assistants.

Step 1: Tokenization — Splitting Text into Pieces

Before any computation can happen, text must be broken into manageable units called tokens. Tokenization is the process of splitting a string of text into these units.

You might assume words are the natural tokens, but modern NLP systems often use something smaller called subword tokenization. Here is why:

A word-level approach would need a separate entry for every word in every language — an enormous, unmanageable vocabulary.
It struggles with rare words and words the model has never seen before (out-of-vocabulary words).
Languages like German that compound words freely (e.g., Donaudampfschifffahrtsgesellschaft) would produce impractically large vocabularies.

Subword tokenization splits text into pieces that are smaller than words but larger than individual characters. The most common algorithm is called Byte Pair Encoding (BPE). It builds a vocabulary by starting with individual characters and then iteratively merging the most frequent pairs of adjacent units. The result is a vocabulary of roughly 30,000 to 100,000 tokens that efficiently covers most common words whole while breaking rare words into recognizable parts.

For example, the word “unhappiness” might be tokenized as [“un”, “happiness”] or [“un”, “happy”, “ness”], depending on the vocabulary. Each of these subword pieces is then converted to a unique integer ID, producing a sequence of numbers that the neural network can process.

Step 2: Word Embeddings — Giving Words Meaning in Numbers

A token ID is just a number in a list — it carries no information about the word's meaning or its relationship to other words. The integer 342 (for “king”) has no inherent relationship to 891 (for “queen”) unless we give it one.

Word embeddings solve this problem by representing each token not as a single integer but as a vector — a list of hundreds or thousands of floating-point numbers. Think of each number in the vector as a coordinate in a high-dimensional space. Words that are used in similar contexts end up close together in this space; words used in different contexts end up far apart.

The famous example, demonstrated with early embedding models like Word2Vec, is that these vector representations capture meaningful relationships:

vector(“king”) − vector(“man”) + vector(“woman”) ≈ vector(“queen”)

This arithmetic tells you that the model has learned — purely from patterns in text, without being told explicitly — that “king” and “queen” are related by gender in the same way that “man” and “woman” are.

In modern language models, embeddings are not fixed in advance but are part of the model's learned parameters. The embedding for each token is updated during training along with all the other weights.

Step 3: Understanding Context — The Transformer Architecture

Early NLP models had a major limitation: they processed text sequentially, one word at a time, and had difficulty “remembering” information from many words back. Long-range dependencies in language — where a pronoun refers to a noun mentioned twenty sentences earlier — were difficult to capture.

The Transformer architecture, introduced in 2017, solved this with a mechanism called self-attention. Self-attention allows every token in a sequence to directly attend to (look at and be influenced by) every other token in the sequence simultaneously, regardless of how far apart they are.

Here is an intuition for self-attention. Consider the sentence: “The animal didn't cross the street because it was too tired.” What does “it” refer to? The animal, not the street. A self-attention mechanism lets the model learn to assign high attention weight to “animal” when processing “it,” effectively learning to resolve pronoun references.

Self-attention does this for every word in every sentence in the training data, learning billions of contextual associations. The result is a model with a rich, nuanced understanding of how words relate to each other in context.

Real-World NLP Applications

NLP is one of the most commercially deployed areas of AI. Here are the main application categories:

Sentiment Analysis

Classifying the emotional tone of text — positive, negative, or neutral — is one of NLP's most widely used tasks. Businesses use it to automatically monitor product reviews, customer support tickets, and social media mentions at scale, without reading every message manually.

Machine Translation

Neural machine translation, which replaced older phrase-based methods around 2016, produces dramatically more fluent translations by learning to translate entire sentences with contextual awareness. Google Translate, DeepL, and similar services now handle over 100 billion words per day.

Text Summarization

NLP models can condense long documents into shorter summaries. Abstractive summarization goes further than extractive methods — rather than simply copying key sentences, it generates new text that captures the meaning in different words.

Question Answering

Given a passage of text and a question, NLP models can extract or generate the answer. This underlies search engines that display direct answer boxes, virtual assistants, and enterprise document search tools.

Text Generation

Large language models generate coherent, contextually appropriate text for a prompt. Applications range from code completion (GitHub Copilot) to creative writing to document drafting to customer service chatbots.

Named Entity Recognition

Identifying and classifying specific entities in text — people, places, organizations, dates, amounts — is essential for information extraction, journalism tools, and business intelligence applications.

How Large Language Models Actually Generate Text

Understanding this mechanism will be crucial in Part 8, but let's introduce it here. A Large Language Model (LLM) like GPT is fundamentally a next-token predictor. During training, it was shown enormous quantities of text and trained to predict: given this sequence of tokens, what token comes next?

At inference time, generating text works like this:

The model receives a prompt (a sequence of tokens).
It runs a forward pass and produces a probability distribution over all tokens in its vocabulary: “Token A has a 15% chance of being next, token B has 8%, token C has 0.001%...”
A token is sampled from this distribution (or selected by taking the most probable one).
That new token is appended to the sequence, and the process repeats.

This is called autoregressive generation — each step's output becomes part of the next step's input. Generate enough tokens this way, and you get paragraphs, essays, code, or entire conversations.

The apparent “understanding” these models display emerges from training on an extraordinary breadth and depth of human text. The model has internalized patterns of reasoning, factual associations, stylistic conventions, and contextual cues from virtually the entire written record of human knowledge.

The Transformer's Role in Multimodal AI

The Transformer architecture, originally designed for text, has proven remarkably versatile. Researchers have adapted it to process images (Vision Transformers, introduced in Part 7), audio (speech recognition and generation), video, and combinations of these modalities simultaneously. Models that can accept both text and images as input — and generate both as output — are called multimodal models.

Multimodal capability means a user can show an AI a photograph of a meal and ask for a recipe, or describe a scene in words and receive a generated image, or upload a chart and ask the model to interpret the trends. The underlying mechanism is the same Transformer architecture, extended to treat image patches or audio spectrograms as tokens alongside text tokens. This unification of modalities under a single architecture is one of the most consequential developments in recent AI research.

Practical Applications of NLP in Business and Society

NLP is already deployed at massive scale in ways that shape daily life, even when users do not realize it:

Search engines: Google's 2019 deployment of BERT for search marked a significant improvement in understanding query intent, particularly for natural-language queries rather than keyword-style searches. An estimated 10% of all search queries were affected in the first rollout.
Customer service: large organizations use NLP-powered chatbots and ticket classification systems to route and respond to customer inquiries. When implemented well, these systems reduce costs while maintaining acceptable response quality for common questions.
Legal and compliance: NLP tools extract clauses from contracts, identify regulatory risk in documents, and flag potential compliance issues at a speed and scale no human team could match.
Healthcare documentation: speech recognition and summarization models transcribe and structure physician notes, reducing documentation burden so clinicians can spend more time with patients.
Language learning: apps like Duolingo use NLP models to evaluate open-ended written and spoken responses, provide targeted feedback, and personalize the learning path.
Content moderation: platforms use NLP classifiers to detect hate speech, spam, misinformation, and policy violations at the scale of billions of posts per day — a task no human workforce could accomplish.

Limitations of Current NLP Systems

Despite impressive capabilities, NLP systems have real limitations:

Hallucinations: models sometimes generate plausible-sounding but factually incorrect information. Because they are statistical pattern-matchers rather than knowledge databases, they cannot always distinguish what they know from what they are guessing.
Context window limits: models can only attend to a finite amount of text at once (their context window). Information outside the window is inaccessible.
Lack of true understanding: models perform impressively on many tasks that seem to require understanding but may fail on novel problems that require genuine reasoning rather than pattern completion.
Language imbalance: models trained predominantly on English-language data perform better in English than in lower-resource languages.

Key Takeaways

NLP converts text into numbers (tokenization + embeddings) and then processes those numbers with neural networks.
Tokens are the basic units of text processing, often subword fragments rather than whole words.
Word embeddings represent meaning as vectors, placing semantically related words near each other in a high-dimensional space.
The Transformer architecture uses self-attention to understand relationships between words regardless of distance.
LLMs generate text by repeatedly predicting the next token, one step at a time.
NLP limitations include hallucinations, context window constraints, and imperfect reasoning.

In Part 7, we shift from language to vision — exploring how AI systems learn to interpret pixels as objects, scenes, and meaning.