Softmax: variational, information-theoretic, and limiting characterizations #
The deep theory of softmax (defined in Core.Probability.LogitChoice): why it
takes the exponential form, what it optimizes, how it concentrates, and its link
to exponential tilting.
Main results #
cauchy_mul_exp,luce_fechnerian_exp— the Fechnerian bridge: the exponential form is forced by a Cauchy functional equation (ratio scale → interval scale).gibbs_variational— softmax uniquely maximizesH(p) + α·⟨p, s⟩on the simplex.softmax_maximizes_entropyReg,softmax_unique_maximizer,softmax_minimizes_freeEnergy,max_entropy_duality,entropy_le_log_card— the maximum-entropy / minimum-free-energy characterization.softmax_argmax_limit,softmax_nonmax_limit— concentration on the argmax as rationality → ∞.bayesian_maximizes— the Bayesian posterior maximizes expected log-likelihood.partitionFn_eq_mgf,logSumExp_eq_cgf,tilted_count_eq_withDensity_softmax— softmax as the counting-measure case of mathlib's exponential tilting.
[UPSTREAM]: pure-math characterizations of softmax; no linguistics. Quality-and-role
marker — sits above LogitChoice, below the rational-agent framing.
Why softmax? The Fechnerian characterization #
The exponential parameterization score = exp(α · utility) is not a design
choice — it is the unique transformation connecting Luce's ratio scale to
a utility (interval) scale ([AM58]).
Ratio vs interval scales. Luce's Axiom 1 (IIA) yields a ratio scale
v: only ratios v(a)/v(b) are meaningful. Fechner's
psychophysics requires an interval scale u: only differences
u(a) - u(b) are meaningful. The question: how are v and u related?
Derivation. From P(a,b) = v(a)/(v(a)+v(b)), the odds ratio is
v(a)/v(b) = g(u(a) - u(b)) for some function g. Transitivity of ratios
(v(a)/v(c) = [v(a)/v(b)] · [v(b)/v(c)]) forces Cauchy's multiplicative
functional equation: g(s + t) = g(s) · g(t). The unique monotone increasing
solution is g(s) = exp(k · s) (cauchy_mul_exp), giving:
v(a) = C · exp(k · u(a))— the ratio scale IS the exponential of utilityP(a,b) = 1/(1 + exp(-k · (u(a) - u(b))))— the logistic function- For n alternatives:
P(a|S) = exp(k · u(a)) / Σ exp(k · u(b))— softmax
The parameter k > 0 is the rationality parameter α in RSA.
Cauchy's multiplicative functional equation (classical): if g : ℝ → ℝ
satisfies g(s + t) = g(s) · g(t) and is strictly monotone increasing, then
g(s) = exp(k · s) for some k > 0.
Fechnerian uniqueness ([AM58]): if a ratio scale v
and interval scale u represent the same ordering via
v(x)/v(y) = g(u(x) - u(y)) for a strictly monotone multiplicative g, then
v is the exponential of u.
Gibbs variational principle #
The softmax distribution uniquely maximizes entropy plus expected score on the probability simplex — the mathematical foundation for RSA convergence ([ZHL21]).
The KL machinery used here — klFinite, kl_eq_sum_klFun, kl_nonneg — is
inlined as private ι → ℝ helpers, since mathlib provides only the PMF form
(PMF.klDiv / PMF.toReal_klDiv_eq_sum_log_div).
The per-meaning speaker objective: f(s) = Σᵤ [negMulLog(sᵤ) + α · sᵤ · vᵤ].
This is the function that the speaker maximizes for each meaning m, where vᵤ = log L(m|u) is the listener utility of utterance u.
Equations
- Core.speakerObj v α s = ∑ u : ι, ((s u).negMulLog + α * s u * v u)
Instances For
The softmax achieves f(s*) = log Z, where Z is the partition function.
Gibbs Variational Principle: softmax maximizes entropy + expected score.
For any distribution p on ι and scores s : ι → ℝ: H(p) + α·⟨p, s⟩ ≤ H(q) + α·⟨q, s⟩ where q = softmax(s, α) and H(p) = Σ negMulLog(pᵢ).
Softmax concentration at high rationality #
As the rationality parameter α → ∞, softmax concentrates all probability mass on the action with highest utility — i.e., softmax converges to argmax. This connects:
- MaxEnt phonology ↔ OT: a MaxEnt grammar with weights
wbecomes categorical (OT-like) as temperature → 0 (equivalently α → ∞). - RSA ↔ neo-Gricean pragmatics: a soft-rational RSA speaker becomes a hard-rational Gricean reasoner in the α → ∞ limit.
Softmax → argmax limit: as α → ∞, softmax concentrates on the unique
maximizer. For any i_max with s i_max strictly greater than all other
scores, softmax(s, α)_{i_max} → 1.
This is the formal connection between MaxEnt (soft optimization) and OT (hard optimization): OT is the α → ∞ limit of MaxEnt.
Complement of the limit: non-maximal actions get probability → 0.
Shannon entropy and maximum entropy #
This section uses the private ι → ℝ Shannon entropy (entropy, above), since
softmax is a real-arithmetic construction; consumers wanting the PMF-typed form
can convert via Core.Probability.Entropy.
Maximum entropy is achieved by the uniform distribution.
Entropy of softmax: H(softmax(s, α)) = log Z - α · 𝔼[s].
Entropy-regularized objective: G_α(p, s) = ⟨s, p⟩ + (1/α) H(p).
Equations
- Core.entropyRegObjective s α p = ∑ i : ι, p i * s i + 1 / α * Core.entropy✝ Finset.univ p
Instances For
The maximum value of the entropy-regularized objective.
Softmax maximizes the entropy-regularized objective.
Softmax is the unique maximizer of the entropy-regularized objective.
Free energy (from statistical mechanics).
Equations
- Core.freeEnergy s α p = -∑ i : ι, p i * s i - 1 / α * Core.entropy✝ Finset.univ p
Instances For
Softmax is the Boltzmann distribution: minimizes free energy.
The log-partition function is convex in α.
Derivative of log-partition gives expected value:
d/dα log(Σ exp(α sᵢ)) = Σ softmax(s,α)ᵢ · sᵢ.
Per-element offsets #
These instantiate the plain logSumExp/softmax at the shifted argument
α • s + r, i.e. logSumExp (α • s + r) = log Σ exp(α·sᵢ + rᵢ). This form
appears when differentiating the log-partition with respect to a single weight
wⱼ in a multi-constraint grammar, where sᵢ is constraint j's contribution
to candidate i and rᵢ the contribution of all other constraints (constant in
wⱼ).
Derivative of offset log-partition gives weighted expected value:
d/dα log(Σ exp(α·sᵢ + rᵢ)) = Σ softmax(α•s + r)ᵢ · sᵢ.
The offset log-partition function is convex in α.
The log conditional likelihood α ↦ (α · sᵧ + rᵧ) − logSumExp(α•s + r)
is concave (affine minus convex).
The derivative of log conditional likelihood α ↦ (α·sᵧ + rᵧ) − logSumExp(α•s + r)
is the observed feature value minus the expected value:
d/dα = sᵧ − Σᵢ softmax(α•s + r)ᵢ · sᵢ.
Strong duality: max entropy = min free energy.
Bayesian optimality #
Bayesian optimality: the normalized posterior maximizes weighted
log-likelihood on the probability simplex. For weights wᵢ ≥ 0 with
C = Σ wᵢ > 0 and any distribution L on the simplex, the normalized posterior
p*ᵢ = wᵢ / C satisfies Σᵢ wᵢ · log Lᵢ ≤ Σᵢ wᵢ · log p*ᵢ.
Connection to exponential tilting and the cumulant generating function #
Softmax is the finite/counting-measure case of mathlib's exponential-family machinery: the partition function is the moment generating function, log-sum-exp is the cumulant generating function, and softmax is the density of the exponentially-tilted (Esscher / Boltzmann–Gibbs) counting measure.
The partition function ∑ exp(α·sᵢ) is the moment generating function of the
scores under the counting measure.
Log-sum-exp is the cumulant generating function of the scores under the counting measure.
Softmax is the density of the exponentially-tilted counting measure: tilting
the counting measure by the scores α·s yields the measure whose density is
softmax (α•s).