Documentation

Linglib.Core.Probability.SoftmaxTheory

Softmax: variational, information-theoretic, and limiting characterizations #

The deep theory of softmax (defined in Core.Probability.LogitChoice): why it takes the exponential form, what it optimizes, how it concentrates, and its link to exponential tilting.

Main results #

[UPSTREAM]: pure-math characterizations of softmax; no linguistics. Quality-and-role marker — sits above LogitChoice, below the rational-agent framing.

Why softmax? The Fechnerian characterization #

The exponential parameterization score = exp(α · utility) is not a design choice — it is the unique transformation connecting Luce's ratio scale to a utility (interval) scale ([AM58]).

Ratio vs interval scales. Luce's Axiom 1 (IIA) yields a ratio scale v: only ratios v(a)/v(b) are meaningful. Fechner's psychophysics requires an interval scale u: only differences u(a) - u(b) are meaningful. The question: how are v and u related?

Derivation. From P(a,b) = v(a)/(v(a)+v(b)), the odds ratio is v(a)/v(b) = g(u(a) - u(b)) for some function g. Transitivity of ratios (v(a)/v(c) = [v(a)/v(b)] · [v(b)/v(c)]) forces Cauchy's multiplicative functional equation: g(s + t) = g(s) · g(t). The unique monotone increasing solution is g(s) = exp(k · s) (cauchy_mul_exp), giving:

The parameter k > 0 is the rationality parameter α in RSA.

theorem Core.cauchy_mul_exp (g : ) (hg_mul : ∀ (s t : ), g (s + t) = g s * g t) (hg_mono : StrictMono g) :
∃ (k : ), 0 < k g 0 = 1 ∀ (s : ), g s = Real.exp (k * s)

Cauchy's multiplicative functional equation (classical): if g : ℝ → ℝ satisfies g(s + t) = g(s) · g(t) and is strictly monotone increasing, then g(s) = exp(k · s) for some k > 0.

theorem Core.luce_fechnerian_exp {X : Type u_1} (v u : X) (g : ) (hv_pos : ∀ (x : X), 0 < v x) (h_ratio : ∀ (x y : X), v x / v y = g (u x - u y)) (hg_mul : ∀ (s t : ), g (s + t) = g s * g t) (hg_mono : StrictMono g) :
∃ (k : ), 0 < k ∀ (x₀ x : X), v x = v x₀ * Real.exp (k * (u x - u x₀))

Fechnerian uniqueness ([AM58]): if a ratio scale v and interval scale u represent the same ordering via v(x)/v(y) = g(u(x) - u(y)) for a strictly monotone multiplicative g, then v is the exponential of u.

Gibbs variational principle #

The softmax distribution uniquely maximizes entropy plus expected score on the probability simplex — the mathematical foundation for RSA convergence ([ZHL21]).

The KL machinery used here — klFinite, kl_eq_sum_klFun, kl_nonneg — is inlined as private ι → ℝ helpers, since mathlib provides only the PMF form (PMF.klDiv / PMF.toReal_klDiv_eq_sum_log_div).

noncomputable def Core.speakerObj {ι : Type u_1} [Fintype ι] (v : ι) (α : ) (s : ι) :

The per-meaning speaker objective: f(s) = Σᵤ [negMulLog(sᵤ) + α · sᵤ · vᵤ].

This is the function that the speaker maximizes for each meaning m, where vᵤ = log L(m|u) is the listener utility of utterance u.

Equations
Instances For
    theorem Core.speakerObj_at_softmax {ι : Type u_1} [Fintype ι] [Nonempty ι] (v : ι) (α : ) :
    speakerObj v α (softmax (α v)) = logSumExp (α v)

    The softmax achieves f(s*) = log Z, where Z is the partition function.

    theorem Core.gibbs_variational {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (α : ) (p : ι) (hp_nonneg : ∀ (i : ι), 0 p i) (hp_sum : i : ι, p i = 1) :
    i : ι, (p i).negMulLog + α * i : ι, p i * s i i : ι, (softmax (α s) i).negMulLog + α * i : ι, softmax (α s) i * s i

    Gibbs Variational Principle: softmax maximizes entropy + expected score.

    For any distribution p on ι and scores s : ι → ℝ: H(p) + α·⟨p, s⟩ ≤ H(q) + α·⟨q, s⟩ where q = softmax(s, α) and H(p) = Σ negMulLog(pᵢ).

    Softmax concentration at high rationality #

    As the rationality parameter α → ∞, softmax concentrates all probability mass on the action with highest utility — i.e., softmax converges to argmax. This connects:

    theorem Core.softmax_argmax_limit {ι : Type u_1} [Fintype ι] [Nonempty ι] [DecidableEq ι] (s : ι) (i_max : ι) (h_max : ∀ (j : ι), j i_maxs j < s i_max) :
    Filter.Tendsto (fun (α : ) => softmax (α s) i_max) Filter.atTop (nhds 1)

    Softmax → argmax limit: as α → ∞, softmax concentrates on the unique maximizer. For any i_max with s i_max strictly greater than all other scores, softmax(s, α)_{i_max} → 1.

    This is the formal connection between MaxEnt (soft optimization) and OT (hard optimization): OT is the α → ∞ limit of MaxEnt.

    theorem Core.softmax_nonmax_limit {ι : Type u_1} [Fintype ι] (s : ι) (i_max : ι) (h_max : ∀ (j : ι), j i_maxs j < s i_max) (j : ι) (hj : j i_max) :
    Filter.Tendsto (fun (α : ) => softmax (α s) j) Filter.atTop (nhds 0)

    Complement of the limit: non-maximal actions get probability → 0.

    Shannon entropy and maximum entropy #

    This section uses the private ι → ℝ Shannon entropy (entropy, above), since softmax is a real-arithmetic construction; consumers wanting the PMF-typed form can convert via Core.Probability.Entropy.

    theorem Core.entropy_le_log_card {ι : Type u_1} [Fintype ι] [Nonempty ι] (p : ι) (hp_nonneg : ∀ (i : ι), 0 p i) (hp_sum : i : ι, p i = 1) :
    Core.entropy✝ Finset.univ p Real.log (Fintype.card ι)

    Maximum entropy is achieved by the uniform distribution.

    theorem Core.entropy_softmax {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (α : ) :
    Core.entropy✝ Finset.univ (softmax (α s)) = Real.log (partitionFn (α s)) - α * i : ι, softmax (α s) i * s i

    Entropy of softmax: H(softmax(s, α)) = log Z - α · 𝔼[s].

    noncomputable def Core.entropyRegObjective {ι : Type u_1} [Fintype ι] (s : ι) (α : ) (p : ι) :

    Entropy-regularized objective: G_α(p, s) = ⟨s, p⟩ + (1/α) H(p).

    Equations
    Instances For
      theorem Core.entropyRegObjective_softmax {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (α : ) ( : 0 < α) :
      entropyRegObjective s α (softmax (α s)) = 1 / α * Real.log (partitionFn (α s))

      The maximum value of the entropy-regularized objective.

      theorem Core.softmax_maximizes_entropyReg {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (α : ) ( : 0 < α) (p : ι) (hp_nonneg : ∀ (i : ι), 0 p i) (hp_sum : i : ι, p i = 1) :

      Softmax maximizes the entropy-regularized objective.

      theorem Core.softmax_unique_maximizer {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (α : ) ( : 0 < α) (p : ι) (hp_nonneg : ∀ (i : ι), 0 p i) (hp_sum : i : ι, p i = 1) (h_max : entropyRegObjective s α p = entropyRegObjective s α (softmax (α s))) :
      p = softmax (α s)

      Softmax is the unique maximizer of the entropy-regularized objective.

      noncomputable def Core.freeEnergy {ι : Type u_1} [Fintype ι] (s : ι) (α : ) (p : ι) :

      Free energy (from statistical mechanics).

      Equations
      Instances For
        theorem Core.softmax_minimizes_freeEnergy {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (α : ) ( : 0 < α) (p : ι) (hp_nonneg : ∀ (i : ι), 0 p i) (hp_sum : i : ι, p i = 1) :
        freeEnergy s α (softmax (α s)) freeEnergy s α p

        Softmax is the Boltzmann distribution: minimizes free energy.

        theorem Core.logSumExp_convex {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) :
        ConvexOn Set.univ fun (α : ) => logSumExp (α s)

        The log-partition function is convex in α.

        theorem Core.deriv_logSumExp {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (α : ) :
        deriv (fun (α : ) => logSumExp (α s)) α = i : ι, softmax (α s) i * s i

        Derivative of log-partition gives expected value: d/dα log(Σ exp(α sᵢ)) = Σ softmax(s,α)ᵢ · sᵢ.

        Per-element offsets #

        These instantiate the plain logSumExp/softmax at the shifted argument α • s + r, i.e. logSumExp (α • s + r) = log Σ exp(α·sᵢ + rᵢ). This form appears when differentiating the log-partition with respect to a single weight wⱼ in a multi-constraint grammar, where sᵢ is constraint j's contribution to candidate i and rᵢ the contribution of all other constraints (constant in wⱼ).

        theorem Core.hasDerivAt_logSumExpOffset {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι) (α : ) :
        HasDerivAt (fun (w : ) => logSumExp (w s + r)) (∑ i : ι, softmax (α s + r) i * s i) α

        Derivative of offset log-partition gives weighted expected value: d/dα log(Σ exp(α·sᵢ + rᵢ)) = Σ softmax(α•s + r)ᵢ · sᵢ.

        theorem Core.logSumExpOffset_convex {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι) :
        ConvexOn Set.univ fun (α : ) => logSumExp (α s + r)

        The offset log-partition function is convex in α.

        theorem Core.logConditional_concaveOn {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι) (y : ι) :
        ConcaveOn Set.univ fun (α : ) => α * s y + r y - logSumExp (α s + r)

        The log conditional likelihood α ↦ (α · sᵧ + rᵧ) − logSumExp(α•s + r) is concave (affine minus convex).

        theorem Core.hasDerivAt_logConditional {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι) (y : ι) (α : ) :
        HasDerivAt (fun (w : ) => w * s y + r y - logSumExp (w s + r)) (s y - i : ι, softmax (α s + r) i * s i) α

        The derivative of log conditional likelihood α ↦ (α·sᵧ + rᵧ) − logSumExp(α•s + r) is the observed feature value minus the expected value: d/dα = sᵧ − Σᵢ softmax(α•s + r)ᵢ · sᵢ.

        theorem Core.max_entropy_duality {ι : Type u_1} [Fintype ι] [Nonempty ι] (s : ι) (c α : ) (_hα : 0 < α) (h_constraint : i : ι, softmax (α s) i * s i = c) :
        Core.entropy✝ Finset.univ (softmax (α s)) = Real.log (partitionFn (α s)) - α * c

        Strong duality: max entropy = min free energy.

        Bayesian optimality #

        theorem Core.bayesian_maximizes {ι : Type u_1} [Fintype ι] (w : ι) (hw_nonneg : ∀ (i : ι), 0 w i) (hC_pos : 0 < i : ι, w i) (L : ι) (hL_pos : ∀ (i : ι), 0 < L i) (hL_sum : i : ι, L i = 1) :
        i : ι, w i * Real.log (L i) i : ι, w i * Real.log (w i / j : ι, w j)

        Bayesian optimality: the normalized posterior maximizes weighted log-likelihood on the probability simplex. For weights wᵢ ≥ 0 with C = Σ wᵢ > 0 and any distribution L on the simplex, the normalized posterior p*ᵢ = wᵢ / C satisfies Σᵢ wᵢ · log Lᵢ ≤ Σᵢ wᵢ · log p*ᵢ.

        Connection to exponential tilting and the cumulant generating function #

        Softmax is the finite/counting-measure case of mathlib's exponential-family machinery: the partition function is the moment generating function, log-sum-exp is the cumulant generating function, and softmax is the density of the exponentially-tilted (Esscher / Boltzmann–Gibbs) counting measure.

        theorem Core.partitionFn_eq_mgf {ι : Type u_1} [Fintype ι] [MeasurableSpace ι] [MeasurableSingletonClass ι] (s : ι) (α : ) :
        partitionFn (α s) = ProbabilityTheory.mgf s MeasureTheory.Measure.count α

        The partition function ∑ exp(α·sᵢ) is the moment generating function of the scores under the counting measure.

        theorem Core.logSumExp_eq_cgf {ι : Type u_1} [Fintype ι] [MeasurableSpace ι] [MeasurableSingletonClass ι] (s : ι) (α : ) :
        logSumExp (α s) = ProbabilityTheory.cgf s MeasureTheory.Measure.count α

        Log-sum-exp is the cumulant generating function of the scores under the counting measure.

        theorem Core.tilted_count_eq_withDensity_softmax {ι : Type u_1} [Fintype ι] [MeasurableSpace ι] [MeasurableSingletonClass ι] (s : ι) (α : ) :
        (MeasureTheory.Measure.count.tilted fun (i : ι) => α * s i) = MeasureTheory.Measure.count.withDensity fun (i : ι) => ENNReal.ofReal (softmax (α s) i)

        Softmax is the density of the exponentially-tilted counting measure: tilting the counting measure by the scores α·s yields the measure whose density is softmax (α•s).