Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)

AI Fundamentals Series · Part 8 of 10 — Previous: Part 7: Computer Vision for Beginners — Next: Part 9: AI Ethics and Risks

What Does “Generative” Mean?

Most of the AI we have discussed so far has been discriminative: given an input, it outputs a label or a decision. A spam filter says “spam” or “not spam.” An image classifier says “cat” or “dog.” A fraud detector says “legitimate” or “suspicious.” These systems consume information and produce a judgment.

Generative AI is different. Instead of classifying inputs, it produces new outputs that resemble the data it was trained on. Give it a text prompt and it writes an essay. Give it a description and it paints a picture. Give it a melody and it composes harmonizing accompaniments. The AI is not looking something up or retrieving stored content — it is synthesizing something new from patterns it has internalized.

This is why generative AI has felt so dramatically different from previous AI products. Earlier AI was useful but invisible infrastructure. Generative AI creates tangible artifacts — text, images, audio, video, code — that feel like the products of human creativity.

Large Language Models: Next-Token Prediction at Scale

We introduced large language models (LLMs) in Part 6. Here, let's go deeper on how they generate convincing text.

At its core, an LLM like GPT-4 or Claude is trained on one extraordinarily simple task: given the tokens that came before, predict the next token. That's it. No explicit objectives like “be helpful” or “write well.” Just: what comes next?

The magic is in the scale and breadth of training. The model trains on hundreds of billions to trillions of tokens drawn from the internet, books, scientific papers, code repositories, and more. To predict the next token accurately across all that diverse text, the model must develop rich internal representations of:

Grammatical structure
Factual associations (the capital of France is Paris)
Reasoning patterns (if A > B and B > C, then A > C)
Writing styles and conventions
Social and cultural context
Cause-and-effect relationships

None of this was taught explicitly. It all emerged from the pressure of predicting the next token accurately, billions of times over.

From Prediction to Conversation: RLHF

A raw next-token predictor would continue any text it is given — not necessarily helpfully or safely. To transform this into a useful assistant, models undergo additional training called Reinforcement Learning from Human Feedback (RLHF):

Human raters evaluate many model responses and rank them by quality, helpfulness, and safety.
A separate model (a “reward model”) learns to predict which responses human raters prefer.
The LLM is fine-tuned using reinforcement learning to generate responses that the reward model scores highly.

This process shapes the model's behavior: it learns to be conversational, avoid harmful content, acknowledge uncertainty, and follow instructions. RLHF is the difference between a raw text predictor and a well-behaved AI assistant.

How Image Generators Work: Diffusion Models

AI image generators like Stable Diffusion, DALL-E, and Midjourney use a different architecture called a diffusion model. Understanding the intuition requires visualizing two processes: a forward process and a reverse process.

The Forward Process (Training)

During training, real images are progressively degraded by adding increasing amounts of random noise, step by step, until the image is pure static — indistinguishable from random noise. The model is trained to learn the reverse of this process: given a slightly noisy image, predict what the image would look like with a tiny bit less noise.

The Reverse Process (Generation)

To generate a new image, the model starts with pure random noise (just static) and repeatedly applies the denoising step, step by step, until a coherent image emerges from the noise. Think of it as a sculptor chipping away at a block of marble — each step reveals a little more of the underlying form.

To control what image emerges, the model is also conditioned on a text description. During training, images were paired with their captions, and the model learned to denoise in a direction that matches the description. At generation time, you provide a text prompt, and the denoising process is guided toward images consistent with that description.

This is why typing “a golden retriever in a space suit” produces a coherent image rather than random output — the model has learned the relationship between those words and visual content during training.

Prompting: Talking to a Generative AI

The text you provide to a generative AI — called a prompt — strongly shapes what you get back. Learning to write effective prompts is now considered a practical skill for working with these systems. A few principles make a significant difference:

Be Specific

Vague prompts produce generic results. Instead of “write me an email,” try “write a polite but firm email to a supplier who has missed two delivery deadlines, requesting a revised schedule within three business days.” The more context and constraints you provide, the more useful the output.

Specify Format and Tone

Tell the model what structure you want: “write this as a bulleted list,” “write at a 10-year-old's reading level,” or “respond in the style of a formal legal letter.” LLMs are highly responsive to explicit stylistic instructions.

Provide Examples

Showing the model examples of what you want — a technique called few-shot prompting — significantly improves output quality for structured tasks. Include two or three examples of (input, desired output) pairs before your actual question.

Ask for Reasoning

Adding the instruction “think step by step” or “explain your reasoning before giving the answer” often improves accuracy on complex tasks. This technique is called chain-of-thought prompting and it helps the model organize its computation into intermediate steps rather than jumping directly to an answer.

Iterate

Treat the first output as a draft, not a final product. Refine the prompt based on what was missing or off-target. Most professional users of LLMs iterate through several prompt versions before reaching their desired output.

Hallucinations: When Generative AI Makes Things Up

One of the most important limitations of current LLMs is their tendency to hallucinate — to confidently generate text that is factually wrong. The model might cite a scientific paper that does not exist, give a wrong date for a historical event, or invent a plausible-sounding but fictitious statistic.

Why does this happen? Remember: the model is a next-token predictor. It is optimized to produce fluent, coherent text — text that follows the statistical patterns of human-written prose. Fluent text often includes specific claims, citations, and facts. The model has learned to produce text that looks like factual writing, even in cases where the underlying claim is wrong or invented.

This is not dishonesty or intentional deception — the model has no concept of truth or falsehood. It simply produces the most statistically likely continuation of the prompt. When that continuation happens to be factually wrong, the model has no mechanism to catch the error.

Mitigations exist: grounding models in retrieved documents (Retrieval-Augmented Generation), training on better-quality factual data, and having models express uncertainty explicitly. But hallucination remains an active area of research and a practical concern for every LLM deployment.

Fine-Tuning: Customizing a Pre-Trained Model

Training a large language model from scratch costs millions to hundreds of millions of dollars. Most organizations cannot and do not need to do this. Instead, they use a technique called fine-tuning: starting with an existing pre-trained model and continuing training on a smaller, domain-specific dataset to adapt the model to a particular task or style.

Fine-tuning is how a general-purpose model becomes a specialized assistant. A legal technology company might fine-tune an LLM on a corpus of legal documents so it better understands contract language and jurisdiction-specific terminology. A medical company might fine-tune on clinical literature. A customer service platform might fine-tune on its brand's tone of voice and its product documentation.

An even lighter-weight approach is prompt engineering without any fine-tuning: crafting prompts that effectively steer the model's behavior using only text instructions and examples, without changing any model weights. This approach is accessible to anyone with API access and produces surprisingly good results for many tasks.

Retrieval-Augmented Generation: Grounding AI in Facts

One of the most practically important techniques for deploying LLMs reliably is Retrieval-Augmented Generation (RAG). The idea is to give the model access to a searchable knowledge base at inference time, so it can retrieve relevant documents before generating its response.

The process works like this:

The user asks a question.
A retrieval system searches a database (company documents, a knowledge base, recent news) for the most relevant content.
That retrieved content is included in the prompt alongside the original question.
The LLM generates a response that is grounded in the retrieved content, and can cite its sources.

RAG dramatically reduces hallucinations for factual questions because the model no longer has to rely solely on what it memorized during training. It can reference current, accurate information that is explicitly present in its context window. This is how many enterprise AI assistants and some consumer tools work in practice. It is also how AI systems handle questions about events that occurred after their training cutoff date.

The Landscape of Generative AI in 2026

The generative AI space has expanded rapidly beyond text and images:

Modality	What It Generates	Notable Examples
Text	Essays, code, conversations, summaries	GPT-4, Claude 3, Gemini Ultra
Images	Photographs, illustrations, concept art	DALL-E 3, Midjourney, Stable Diffusion
Audio	Music, voice clones, sound effects	Suno, ElevenLabs, MusicGen
Video	Short clips, movie scenes	Sora, Runway Gen-3, Kling
Code	Functions, scripts, entire programs	GitHub Copilot, Cursor, Devin
3D / Multimodal	3D models, combined text + image + audio	Point-E, Gemini 1.5 Flash

The Copyright and Attribution Question

Generative AI has ignited fierce legal and ethical debate about copyright. Training data for image generators includes billions of images scraped from the internet, including images created by professional artists, photographers, and illustrators who did not consent to having their work used as training material. Training data for language models includes books, articles, and code written by authors and developers who similarly had no say in the matter.

Several legal cases are working through courts as of 2026, addressing questions that existing copyright law was not designed to answer:

Does training an AI model on copyrighted material constitute copyright infringement?
Do AI-generated outputs that stylistically resemble a specific human artist's work infringe that artist's rights?
Can AI-generated content itself be copyrighted, and if so, by whom — the user, the AI company, or neither?
Do artists have the right to opt out of having their work included in training datasets?

The outcomes of these cases will significantly shape the legal landscape for generative AI. Some AI companies have proactively introduced opt-out mechanisms for artists and have begun developing models trained on licensed content. The tension between the enormous value of large training datasets and the legitimate interests of content creators is one of the defining legal challenges of the current AI era.

Evaluating Generative AI Output: What to Check For

As generative AI becomes a common tool in professional and creative workflows, developing the habit of critically evaluating AI output is essential. A practical checklist:

Factual accuracy: verify any specific claims, statistics, citations, or proper nouns against authoritative sources. Hallucinations are most common for specific facts rather than general concepts.
Logical consistency: does the argument or explanation hold together? LLMs can produce text that sounds coherent but contains internal contradictions upon careful reading.
Completeness: has the AI addressed all aspects of the question, or has it produced a plausible-sounding but incomplete answer that covers only the most common cases?
Bias and perspective: whose viewpoint does the output reflect? LLMs trained predominantly on English-language Western text may default to assumptions and framings that are not universal.
Appropriateness for context: is the tone, format, and level of technicality right for the intended audience and use case?

Developing fluency with generative AI involves learning both how to elicit useful outputs and how to critically evaluate what you receive. The most effective users of these tools are those who treat AI output as a capable but fallible collaborator that requires oversight, not as an infallible oracle.

Key Takeaways

Generative AI creates new content rather than classifying existing content.
LLMs are trained on next-token prediction at scale; the ability to generate helpful text emerges from this simple objective.
RLHF shapes raw LLMs into helpful, safe assistants by incorporating human preference signals.
Image generators use diffusion models: they learn to reverse a noise-addition process, guided by text prompts.
Effective prompting — specificity, formatting instructions, examples, chain-of-thought — dramatically improves outputs.
Hallucinations occur because LLMs optimize for fluency, not factual accuracy.

We have now covered how AI works across text, images, and generative applications. Part 9 steps back from the technical to ask the harder questions: what are the ethical risks of these systems, and what could go wrong at societal scale?