Documentation

Linglib.Core.Probability.SoftmaxTheory

Softmax: variational, information-theoretic, and limiting characterizations #

The deep theory of softmax (defined in Core.Probability.LogitChoice): why it takes the exponential form, what it optimizes, how it concentrates, and its link to exponential tilting.

Main results #

cauchy_mul_exp, luce_fechnerian_exp — the Fechnerian bridge: the exponential form is forced by a Cauchy functional equation (ratio scale → interval scale).
gibbs_variational — softmax uniquely maximizes H(p) + α·⟨p, s⟩ on the simplex.
softmax_maximizes_entropyReg, softmax_unique_maximizer, softmax_minimizes_freeEnergy, max_entropy_duality, entropy_le_log_card — the maximum-entropy / minimum-free-energy characterization.
softmax_argmax_limit, softmax_nonmax_limit — concentration on the argmax as rationality → ∞.
bayesian_maximizes — the Bayesian posterior maximizes expected log-likelihood.
partitionFn_eq_mgf, logSumExp_eq_cgf, tilted_count_eq_withDensity_softmax — softmax as the counting-measure case of mathlib's exponential tilting.

[UPSTREAM]: pure-math characterizations of softmax; no linguistics. Quality-and-role marker — sits above LogitChoice, below the rational-agent framing.

Why softmax? The Fechnerian characterization #

The exponential parameterization score = exp(α · utility) is not a design choice — it is the unique transformation connecting Luce's ratio scale to a utility (interval) scale ([AM58]).

Ratio vs interval scales. Luce's Axiom 1 (IIA) yields a ratio scale v: only ratios v(a)/v(b) are meaningful. Fechner's psychophysics requires an interval scale u: only differences u(a) - u(b) are meaningful. The question: how are v and u related?

Derivation. From P(a,b) = v(a)/(v(a)+v(b)), the odds ratio is v(a)/v(b) = g(u(a) - u(b)) for some function g. Transitivity of ratios (v(a)/v(c) = [v(a)/v(b)] · [v(b)/v(c)]) forces Cauchy's multiplicative functional equation: g(s + t) = g(s) · g(t). The unique monotone increasing solution is g(s) = exp(k · s) (cauchy_mul_exp), giving:

v(a) = C · exp(k · u(a)) — the ratio scale IS the exponential of utility
P(a,b) = 1/(1 + exp(-k · (u(a) - u(b)))) — the logistic function
For n alternatives: P(a|S) = exp(k · u(a)) / Σ exp(k · u(b)) — softmax

The parameter k > 0 is the rationality parameter α in RSA.

theorem Core.cauchy_mul_exp (g : ℝ → ℝ) (hg_mul : ∀ (s t : ℝ), g (s + t) = g s * g t) (hg_mono : StrictMono g) :

∃ (k : ℝ), 0 < k ∧ g 0 = 1 ∧ ∀ (s : ℝ), g s = Real.exp (k * s)

Cauchy's multiplicative functional equation (classical): if g : ℝ → ℝ satisfies g(s + t) = g(s) · g(t) and is strictly monotone increasing, then g(s) = exp(k · s) for some k > 0.

theorem Core.luce_fechnerian_exp {X : Type u_1} (v u : X → ℝ) (g : ℝ → ℝ) (hv_pos : ∀ (x : X), 0 < v x) (h_ratio : ∀ (x y : X), v x / v y = g (u x - u y)) (hg_mul : ∀ (s t : ℝ), g (s + t) = g s * g t) (hg_mono : StrictMono g) :

∃ (k : ℝ), 0 < k ∧ ∀ (x₀ x : X), v x = v x₀ * Real.exp (k * (u x - u x₀))

Fechnerian uniqueness ([AM58]): if a ratio scale v and interval scale u represent the same ordering via v(x)/v(y) = g(u(x) - u(y)) for a strictly monotone multiplicative g, then v is the exponential of u.

Gibbs variational principle #

The softmax distribution uniquely maximizes entropy plus expected score on the probability simplex — the mathematical foundation for RSA convergence ([ZHL21]).

The KL machinery used here — klFinite, kl_eq_sum_klFun, kl_nonneg — is inlined as private ι → ℝ helpers, since mathlib provides only the PMF form (PMF.klDiv / PMF.toReal_klDiv_eq_sum_log_div).

noncomputable def Core.speakerObj {ι : Type u_1} [Fintype ι] (v : ι → ℝ) (α : ℝ) (s : ι → ℝ) :

ℝ

The per-meaning speaker objective: f(s) = Σᵤ [negMulLog(sᵤ) + α · sᵤ · vᵤ].

This is the function that the speaker maximizes for each meaning m, where vᵤ = log L(m|u) is the listener utility of utterance u.

Equations

Core.speakerObj v α s = ∑ u : ι, ((s u).negMulLog + α * s u * v u)

Instances For

theorem Core.speakerObj_at_softmax {ι : Type u_1} [Fintype ι] [Nonempty ι] (v : ι → ℝ) (α : ℝ) :

speakerObj v α (softmax (α • v)) = logSumExp (α • v)

The softmax achieves f(s*) = log Z, where Z is the partition function.

theorem Core.gibbs_variational {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (α : ℝ) (p : ι → ℝ) (hp_nonneg : ∀ (i : ι), 0 ≤ p i) (hp_sum : ∑ i : ι, p i = 1) :

∑ i : ι, (p i).negMulLog + α * ∑ i : ι, p i * s i ≤ ∑ i : ι, (softmax (α • s) i).negMulLog + α * ∑ i : ι, softmax (α • s) i * s i

Gibbs Variational Principle: softmax maximizes entropy + expected score.

For any distribution p on ι and scores s : ι → ℝ: H(p) + α·⟨p, s⟩ ≤ H(q) + α·⟨q, s⟩ where q = softmax(s, α) and H(p) = Σ negMulLog(pᵢ).

Softmax concentration at high rationality #

As the rationality parameter α → ∞, softmax concentrates all probability mass on the action with highest utility — i.e., softmax converges to argmax. This connects:

MaxEnt phonology ↔ OT: a MaxEnt grammar with weights w becomes categorical (OT-like) as temperature → 0 (equivalently α → ∞).
RSA ↔ neo-Gricean pragmatics: a soft-rational RSA speaker becomes a hard-rational Gricean reasoner in the α → ∞ limit.

theorem Core.softmax_argmax_limit {ι : Type u_1} [Fintype ι] [Nonempty ι] [DecidableEq ι] (s : ι → ℝ) (i_max : ι) (h_max : ∀ (j : ι), j ≠ i_max → s j < s i_max) :

Filter.Tendsto (fun (α : ℝ) => softmax (α • s) i_max) Filter.atTop (nhds 1)

Softmax → argmax limit: as α → ∞, softmax concentrates on the unique maximizer. For any i_max with s i_max strictly greater than all other scores, softmax(s, α)_{i_max} → 1.

This is the formal connection between MaxEnt (soft optimization) and OT (hard optimization): OT is the α → ∞ limit of MaxEnt.

theorem Core.softmax_nonmax_limit {ι : Type u_1} [Fintype ι] (s : ι → ℝ) (i_max : ι) (h_max : ∀ (j : ι), j ≠ i_max → s j < s i_max) (j : ι) (hj : j ≠ i_max) :

Filter.Tendsto (fun (α : ℝ) => softmax (α • s) j) Filter.atTop (nhds 0)

Complement of the limit: non-maximal actions get probability → 0.

Shannon entropy and maximum entropy #

This section uses the private ι → ℝ Shannon entropy (entropy, above), since softmax is a real-arithmetic construction; consumers wanting the PMF-typed form can convert via Core.Probability.Entropy.

theorem Core.entropy_le_log_card {ι : Type u_1} [Fintype ι] [Nonempty ι] (p : ι → ℝ) (hp_nonneg : ∀ (i : ι), 0 ≤ p i) (hp_sum : ∑ i : ι, p i = 1) :

Core.entropy✝ Finset.univ p ≤ Real.log ↑(Fintype.card ι)

Maximum entropy is achieved by the uniform distribution.

theorem Core.entropy_softmax {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (α : ℝ) :

Core.entropy✝ Finset.univ (softmax (α • s)) = Real.log (partitionFn (α • s)) - α * ∑ i : ι, softmax (α • s) i * s i

Entropy of softmax: H(softmax(s, α)) = log Z - α · 𝔼[s].

noncomputable def Core.entropyRegObjective {ι : Type u_1} [Fintype ι] (s : ι → ℝ) (α : ℝ) (p : ι → ℝ) :

ℝ

Entropy-regularized objective: G_α(p, s) = ⟨s, p⟩ + (1/α) H(p).

Equations

Core.entropyRegObjective s α p = ∑ i : ι, p i * s i + 1 / α * Core.entropy✝ Finset.univ p

Instances For

theorem Core.entropyRegObjective_softmax {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (α : ℝ) (hα : 0 < α) :

entropyRegObjective s α (softmax (α • s)) = 1 / α * Real.log (partitionFn (α • s))

The maximum value of the entropy-regularized objective.

theorem Core.softmax_maximizes_entropyReg {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (α : ℝ) (hα : 0 < α) (p : ι → ℝ) (hp_nonneg : ∀ (i : ι), 0 ≤ p i) (hp_sum : ∑ i : ι, p i = 1) :

entropyRegObjective s α p ≤ entropyRegObjective s α (softmax (α • s))

Softmax maximizes the entropy-regularized objective.

theorem Core.softmax_unique_maximizer {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (α : ℝ) (hα : 0 < α) (p : ι → ℝ) (hp_nonneg : ∀ (i : ι), 0 ≤ p i) (hp_sum : ∑ i : ι, p i = 1) (h_max : entropyRegObjective s α p = entropyRegObjective s α (softmax (α • s))) :

p = softmax (α • s)

Softmax is the unique maximizer of the entropy-regularized objective.

noncomputable def Core.freeEnergy {ι : Type u_1} [Fintype ι] (s : ι → ℝ) (α : ℝ) (p : ι → ℝ) :

ℝ

Free energy (from statistical mechanics).

Equations

Core.freeEnergy s α p = -∑ i : ι, p i * s i - 1 / α * Core.entropy✝ Finset.univ p

Instances For

theorem Core.softmax_minimizes_freeEnergy {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (α : ℝ) (hα : 0 < α) (p : ι → ℝ) (hp_nonneg : ∀ (i : ι), 0 ≤ p i) (hp_sum : ∑ i : ι, p i = 1) :

freeEnergy s α (softmax (α • s)) ≤ freeEnergy s α p

Softmax is the Boltzmann distribution: minimizes free energy.

theorem Core.logSumExp_convex {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) :

ConvexOn ℝ Set.univ fun (α : ℝ) => logSumExp (α • s)

The log-partition function is convex in α.

theorem Core.deriv_logSumExp {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (α : ℝ) :

deriv (fun (α : ℝ) => logSumExp (α • s)) α = ∑ i : ι, softmax (α • s) i * s i

Derivative of log-partition gives expected value: d/dα log(Σ exp(α sᵢ)) = Σ softmax(s,α)ᵢ · sᵢ.

Per-element offsets #

These instantiate the plain logSumExp/softmax at the shifted argument α • s + r, i.e. logSumExp (α • s + r) = log Σ exp(α·sᵢ + rᵢ). This form appears when differentiating the log-partition with respect to a single weight wⱼ in a multi-constraint grammar, where sᵢ is constraint j's contribution to candidate i and rᵢ the contribution of all other constraints (constant in wⱼ).

theorem Core.hasDerivAt_logSumExpOffset {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι → ℝ) (α : ℝ) :

HasDerivAt (fun (w : ℝ) => logSumExp (w • s + r)) (∑ i : ι, softmax (α • s + r) i * s i) α

Derivative of offset log-partition gives weighted expected value: d/dα log(Σ exp(α·sᵢ + rᵢ)) = Σ softmax(α•s + r)ᵢ · sᵢ.

theorem Core.logSumExpOffset_convex {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι → ℝ) :

ConvexOn ℝ Set.univ fun (α : ℝ) => logSumExp (α • s + r)

The offset log-partition function is convex in α.

theorem Core.logConditional_concaveOn {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι → ℝ) (y : ι) :

ConcaveOn ℝ Set.univ fun (α : ℝ) => α * s y + r y - logSumExp (α • s + r)

The log conditional likelihood α ↦ (α · sᵧ + rᵧ) − logSumExp(α•s + r) is concave (affine minus convex).

theorem Core.hasDerivAt_logConditional {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι → ℝ) (y : ι) (α : ℝ) :

HasDerivAt (fun (w : ℝ) => w * s y + r y - logSumExp (w • s + r)) (s y - ∑ i : ι, softmax (α • s + r) i * s i) α

The derivative of log conditional likelihood α ↦ (α·sᵧ + rᵧ) − logSumExp(α•s + r) is the observed feature value minus the expected value: d/dα = sᵧ − Σᵢ softmax(α•s + r)ᵢ · sᵢ.

theorem Core.max_entropy_duality {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι → ℝ) (c α : ℝ) (_hα : 0 < α) (h_constraint : ∑ i : ι, softmax (α • s) i * s i = c) :

Core.entropy✝ Finset.univ (softmax (α • s)) = Real.log (partitionFn (α • s)) - α * c

Strong duality: max entropy = min free energy.

Bayesian optimality #

theorem Core.bayesian_maximizes {ι : Type u_1} [Fintype ι] (w : ι → ℝ) (hw_nonneg : ∀ (i : ι), 0 ≤ w i) (hC_pos : 0 < ∑ i : ι, w i) (L : ι → ℝ) (hL_pos : ∀ (i : ι), 0 < L i) (hL_sum : ∑ i : ι, L i = 1) :

∑ i : ι, w i * Real.log (L i) ≤ ∑ i : ι, w i * Real.log (w i / ∑ j : ι, w j)

Bayesian optimality: the normalized posterior maximizes weighted log-likelihood on the probability simplex. For weights wᵢ ≥ 0 with C = Σ wᵢ > 0 and any distribution L on the simplex, the normalized posterior p*ᵢ = wᵢ / C satisfies Σᵢ wᵢ · log Lᵢ ≤ Σᵢ wᵢ · log p*ᵢ.

Connection to exponential tilting and the cumulant generating function #

Softmax is the finite/counting-measure case of mathlib's exponential-family machinery: the partition function is the moment generating function, log-sum-exp is the cumulant generating function, and softmax is the density of the exponentially-tilted (Esscher / Boltzmann–Gibbs) counting measure.

theorem Core.partitionFn_eq_mgf {ι : Type u_1} [Fintype ι] [MeasurableSpace ι] [MeasurableSingletonClass ι] (s : ι → ℝ) (α : ℝ) :

partitionFn (α • s) = ProbabilityTheory.mgf s MeasureTheory.Measure.count α

The partition function ∑ exp(α·sᵢ) is the moment generating function of the scores under the counting measure.

theorem Core.logSumExp_eq_cgf {ι : Type u_1} [Fintype ι] [MeasurableSpace ι] [MeasurableSingletonClass ι] (s : ι → ℝ) (α : ℝ) :

logSumExp (α • s) = ProbabilityTheory.cgf s MeasureTheory.Measure.count α

Log-sum-exp is the cumulant generating function of the scores under the counting measure.

theorem Core.tilted_count_eq_withDensity_softmax {ι : Type u_1} [Fintype ι] [MeasurableSpace ι] [MeasurableSingletonClass ι] (s : ι → ℝ) (α : ℝ) :

(MeasureTheory.Measure.count.tilted fun (i : ι) => α * s i) = MeasureTheory.Measure.count.withDensity fun (i : ι) => ENNReal.ofReal (softmax (α • s) i)

Softmax is the density of the exponentially-tilted counting measure: tilting the counting measure by the scores α·s yields the measure whose density is softmax (α•s).