Optimization Algorithms: From Gradient Descent to Evolutionary Methods

What Optimization Means Mathematically

Optimization is the process of finding the input to a function that produces the minimum or maximum output value — the best solution according to some defined criterion. In mathematical terms, we seek x* such that f(x*) ≤ f(x) for all x in the feasible region (for minimization), where f is the objective function and the feasible region is defined by constraints on the allowed values of x. This abstract formulation encompasses an enormous range of practical problems: minimizing a factory's production cost, maximizing the range of an aircraft, finding the shortest delivery route, training a neural network to recognize images, or designing a drug molecule to bind a target protein.

Optimization problems can be classified along several dimensions. If f and the constraints are linear in x, the problem is a linear program, solvable efficiently by the simplex method or interior point methods. If f is nonlinear, it becomes a nonlinear program, which may have multiple local optima — points that are optimal relative to nearby alternatives but not globally. If some or all variables must be integers, the problem is an integer or mixed-integer program, dramatically harder. If the problem involves uncertainty in the objective or constraints, it becomes stochastic optimization. The choice of algorithm depends critically on the structure of the problem: no single method dominates across all types.

Gradient Descent and Its Variants

Gradient descent is the workhorse optimization algorithm for smooth, differentiable objective functions, and the engine behind training virtually all modern machine learning models. The key insight is that the gradient ∇f(x) — the vector of partial derivatives of f with respect to each variable — points in the direction of steepest increase of f. By moving in the opposite direction (the negative gradient), we descend toward a local minimum. The update rule is simple: x_{t+1} = x_t - α × ∇f(x_t), where α (the learning rate) controls step size.

In machine learning, computing the gradient over the full dataset (batch gradient descent) is often computationally prohibitive. Stochastic gradient descent (SGD) approximates the gradient using a single randomly selected training example per step, producing noisy but fast updates. Mini-batch gradient descent strikes a balance, using subsets of 32 to 512 examples. Modern deep learning relies almost exclusively on mini-batch SGD and its variants. Momentum methods accelerate convergence by accumulating a velocity vector, allowing the algorithm to build speed in consistent directions and dampen oscillations — analogous to a ball rolling down a hill. Nesterov accelerated gradient improves on standard momentum by evaluating the gradient at the anticipated position rather than the current one.

Adaptive gradient methods adjust the learning rate individually for each parameter based on historical gradient information. AdaGrad accumulates squared gradients, reducing the learning rate for frequently updated parameters and increasing it for rarely updated ones — beneficial for sparse data. RMSprop uses an exponentially weighted average of squared gradients to prevent AdaGrad's learning rate from shrinking to zero. Adam (Adaptive Moment Estimation), combining momentum and RMSprop ideas, is currently the most widely used optimizer in deep learning, maintaining both first-moment (mean) and second-moment (uncentered variance) estimates of gradients.

Convexity and Why It Matters

A function is convex if the line segment between any two points on its graph lies above or on the graph — geometrically, the function curves upward everywhere with no local valleys. Convex optimization problems have a crucial property: any local minimum is also a global minimum. This makes convex problems much more tractable than general nonlinear problems, and a rich theory guarantees efficient algorithms with provable convergence properties.

Linear programming is a special case of convex optimization. Quadratic programming (convex quadratic objective with linear constraints), Second Order Cone Programs, and Semidefinite Programs are increasingly general convex problem classes with efficient solvers. Support Vector Machines, LASSO regularization in statistics, and many signal processing problems reduce to convex optimization. Recognizing that a problem is convex — or reformulating it as one — often transforms an intractable problem into one solvable in seconds. The field of convex optimization, developed comprehensively in Stephen Boyd's influential textbook, has become a fundamental tool across engineering, statistics, economics, and machine learning.

Newton's Method and Second-Order Methods

Gradient descent uses only first-order information (the gradient). Newton's method incorporates second-order information — the Hessian matrix of second derivatives — to take more informed steps. By fitting a quadratic approximation to f near the current point and jumping to the minimum of this approximation, Newton's method converges quadratically near a solution (the error roughly squares each iteration), compared to gradient descent's linear convergence. The update is x_{t+1} = x_t - H^{-1} ∇f(x_t), where H is the Hessian.

Newton's method's drawback is computational cost: computing and inverting the Hessian requires O(n²) memory and O(n³) operations for n parameters — prohibitive for modern neural networks with billions of parameters. Quasi-Newton methods, particularly L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno), approximate the inverse Hessian using a history of gradient differences, achieving much of Newton's faster convergence at a fraction of the cost. L-BFGS is widely used in training logistic regression, conditional random fields, and smaller neural networks. Interior point methods, which use Newton steps to traverse the feasible region while maintaining barrier functions for constraints, are the gold standard for general convex programs and large-scale linear programs.

Metaheuristic and Evolutionary Methods

For black-box optimization problems — where the objective function has no exploitable mathematical structure, may not be differentiable, or has complex constraint structure — gradient-based methods fail and metaheuristic approaches become necessary. These algorithms are inspired by natural processes and make no assumptions about the mathematical form of the objective, requiring only the ability to evaluate f(x) at candidate solutions.

Genetic algorithms (GAs) model evolution by natural selection. A population of candidate solutions is evolved through selection (keeping the fittest individuals), crossover (combining parts of two parent solutions to create offspring), and mutation (randomly modifying solutions). Over generations, the population evolves toward better solutions. GAs excel at combinatorial problems like circuit layout, scheduling, and protein structure prediction. Simulated annealing mimics the physical process of cooling a material: at high "temperature," worse solutions are occasionally accepted (allowing escape from local optima); as temperature decreases, only improvements are accepted. Particle swarm optimization models a swarm of particles moving through the search space, with each particle influenced by its own best position and the swarm's global best — effective for continuous optimization. Ant colony optimization models foraging ants depositing pheromones on promising paths, naturally amplifying good solutions and has found success in vehicle routing and network design.

Bayesian Optimization and Modern Applications

Bayesian optimization addresses the problem of optimizing expensive black-box functions — where each evaluation requires significant time or computational resources, such as running a physical experiment, training a large neural network, or simulating a complex system. It maintains a probabilistic surrogate model (typically a Gaussian process) of the objective function and uses this model to intelligently select the next evaluation point, balancing exploration (querying uncertain regions) and exploitation (focusing near known optima).

The acquisition function — such as Expected Improvement or Upper Confidence Bound — quantifies the value of evaluating each candidate point given current knowledge. Bayesian optimization typically finds good solutions in tens to hundreds of evaluations, compared to random search or grid search requiring thousands. It has become standard for hyperparameter tuning in machine learning, design of pharmaceutical experiments, and engineering design optimization. Modern research combines Bayesian optimization with neural surrogate models (neural architecture search), multi-fidelity evaluations (using cheap approximations to guide expensive ones), and parallel experimental design. As optimization underpins virtually every quantitative decision-making problem in science, engineering, and economics, advances in optimization algorithms continue to have enormous practical impact across industry and research.

Optimization Algorithms: From Gradient Descent to Evolutionary Methods

What Optimization Means Mathematically

Gradient Descent and Its Variants

Convexity and Why It Matters

Newton's Method and Second-Order Methods

Metaheuristic and Evolutionary Methods

Bayesian Optimization and Modern Applications

Related Articles

Bayes' Theorem: How to Update Beliefs With New Evidence

Game Theory Explained: Nash Equilibria, Prisoner's Dilemma, and Strategic Decision-Making

How Bayesian Statistics Updates Beliefs With New Evidence

How Compound Interest Works: The Math Behind Exponential Growth