How Generative AI Creates Text, Images, and Code from Prompts

ChatGPT Reached 100 Million Users in Two Months — Faster Than Any Application in History

When OpenAI launched ChatGPT in November 2022, it acquired one million users in five days and 100 million in two months — a rate of adoption that surpassed Instagram's 2.5 years and TikTok's nine months, according to UBS analyst estimates. The underlying technology — large language models based on the transformer architecture — had been developing for years before this public moment, but ChatGPT's conversational interface made generative AI accessible to non-technical users at scale for the first time. Generative AI is not a single technology but a family of approaches that share a common capability: producing novel, contextually appropriate content — text, images, audio, video, or code — in response to a prompt. Each modality rests on distinct but related technical foundations.

Text Generation: How Language Models Work

Large language models (LLMs) are autoregressive neural networks trained to predict the next token (word fragment) in a sequence given all preceding tokens. This simple objective — predict what comes next — applied at scale to massive text corpora produces models with surprising generalization: the ability to answer questions, summarize documents, write code, and follow complex instructions without these capabilities being explicitly programmed.

The transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," is the foundation of modern LLMs. Its core mechanism is self-attention: a mathematical operation that allows each token in a sequence to attend to every other token and weigh their relevance to its own representation. This allows the model to capture long-range dependencies — understanding that a pronoun on line 50 refers to a noun on line 5 — that prior architectures like RNNs struggled with at scale.

The Training Pipeline for LLMs

Pre-training: The model processes trillions of tokens from books, web pages, code repositories, and other text. It learns language statistics and general world knowledge from predicting the next token with a cross-entropy loss.
Supervised fine-tuning (SFT): Human annotators demonstrate desired input-output behaviors; the model is fine-tuned on these examples to follow instructions and respond conversationally.
Reinforcement Learning from Human Feedback (RLHF): Human raters rank model outputs; a reward model is trained on these rankings; the LLM is further optimized using reinforcement learning (PPO algorithm) to produce outputs humans prefer.

Image Generation: Diffusion Models and GANs

Image generation systems work differently from text models. The dominant approach since 2021 is the diffusion model. Diffusion models learn to reverse a noise-adding process: during training, noise is progressively added to an image in thousands of small steps, and the model learns to predict and subtract the noise at each step. At inference time, the model starts from pure random noise and iteratively denoises, guided by a text prompt, to produce a coherent image.

Latent diffusion models (the architecture behind Stable Diffusion and DALL-E 3) perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost. A text encoder (typically CLIP or a similar contrastive model) converts the text prompt into an embedding that guides the denoising process via cross-attention layers — causing the emerging image to match the semantic content of the prompt.

System	Architecture	Training Scale	Key Capability
Stable Diffusion (open source)	Latent diffusion + CLIP	LAION-5B dataset (5 billion image-text pairs)	Customizable; fine-tunable; runs locally
DALL-E 3 (OpenAI)	Latent diffusion + GPT-4 recaptioning	Proprietary, large scale	High prompt coherence; text rendering
Midjourney	Proprietary (believed diffusion-based)	Proprietary	Distinctive aesthetic quality; community tuning
Imagen / Imagen 2 (Google)	Cascade diffusion + T5 text encoder	Proprietary	Strong text-image alignment; photorealism
Sora (OpenAI)	Video diffusion transformer	Proprietary video dataset	Coherent long-form video generation (2024)

Code Generation: Specialized LLMs for Programming

Code generation models are LLMs trained on large corpora of source code (GitHub repositories, documentation, coding forums) in addition to natural language text. Models like GitHub Copilot (based on OpenAI's Codex), Meta's Code Llama, and Google's Gemini Code are capable of completing functions, translating between programming languages, writing unit tests, and explaining existing code.

Code is structurally different from natural language: it must be syntactically correct, semantically consistent, and functionally correct — properties that are partially evaluable by executing the generated code. Training with execution feedback — rewarding models that produce code that passes tests — has significantly improved functional correctness. A 2023 paper from DeepMind (AlphaCode 2) demonstrated competitive performance at Codeforces programming competitions, solving problems at the level of the 85th percentile of human competitors.

Emergent Capabilities and Scaling Laws

One of the most striking properties of LLMs is emergence: qualitative new capabilities that appear abruptly as model scale crosses certain thresholds, rather than improving gradually. Chain-of-thought reasoning, multi-step arithmetic, and the ability to perform few-shot learning from examples provided in the prompt emerged at around 100 billion parameters — below which these capabilities were essentially absent. Kaplan et al.'s scaling laws (2020) demonstrated that model performance on language tasks scales predictably with model size, dataset size, and compute — a finding that justified the enormous investment in training progressively larger models.

GPT-4 (2023) has not had its parameter count officially disclosed; estimates range from 220 billion to 1.8 trillion in a mixture-of-experts architecture.
Chinchilla scaling laws (Hoffmann et al. 2022) revised the Kaplan scaling laws, finding that for a given compute budget, models were being overtrained relative to the dataset size — recommending smaller models trained on more data.
Constitutional AI (Anthropic) and Direct Preference Optimization (DPO) are alignment techniques that reduce reliance on human labelers in RLHF while maintaining or improving safety properties.

What Generative AI Cannot Do

Current generative AI systems do not "understand" content in the way humans do. They have no persistent memory across conversations (without explicit architectural additions), no grounded real-world experience, no causal models of physics or human motivation, and no guaranteed factual accuracy — a property often called hallucination. LLMs generate statistically plausible continuations of text; when confident-sounding plausible text about a factual question has no grounding in the training data, the model can fabricate citations, dates, and facts with apparent certainty. These limitations are active research areas, addressed through retrieval-augmented generation (RAG), tool use, and improved training objectives — but remain unresolved at a fundamental level as of 2025.

How Generative AI Creates Text, Images, and Code from Prompts

ChatGPT Reached 100 Million Users in Two Months — Faster Than Any Application in History

Text Generation: How Language Models Work

The Training Pipeline for LLMs

Image Generation: Diffusion Models and GANs

Code Generation: Specialized LLMs for Programming

Emergent Capabilities and Scaling Laws

What Generative AI Cannot Do

Related Articles

AI Ethics: Bias, Fairness, Accountability, and the Governance Challenge

The History of AI: From Turing's Test to ChatGPT (Part 2)

Neural Networks for Beginners: How AI Mimics the Brain (Part 5)

Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)