Softmax as lse-Renormalised Exponential #
@cite{litvinov-2005}
The bridge between the warped semiring (lse α) and the choice rule
(softmaxDecoder α). The softmax probability admits the clean
"partition-function" form
softmaxDecoder α |>.decode cands score c
= exp(α · (score c - lseFinset α cands score))
for c ∈ cands (the candidate set), since the partition function
Z = Σ exp(α · score c') is exactly exp(α · lseFinset α cands score)
by the cancellation identity exp_alpha_lseFinset.
This identity makes the softmax → argmax limit a direct corollary of
the lse → max limit (lse_tendsto_max):
lseFinset α cands score → max_c (score c)asα → ∞,- so the exponent
α · (score c - lseFinset α cands score)tends to0for the unique maximizer (probability → 1) and to−∞for any loser (probability → 0).
That gives a uniform algebraic explanation for OT-as-MaxEnt-limit: the softmax decoder is "exp of the gap between this candidate's score and the lse-summary score," and the gap closes exactly on the winners when the warped semiring is dequantized.
Partition-function form of the softmax decoder. For c ∈ cands
and α ≠ 0,
softmaxDecoder α |>.decode cands score c
= exp(α · (score c - lseFinset α cands score))
Direct from the cancellation identity exp_alpha_lseFinset: the
denominator Σ exp(α · score c') equals exp(α · lseFinset α …),
so the softmax ratio reduces to exp(α · score c) / exp(α · lse) = exp(α · (score c − lse)).
The softmax-decoder probability of a candidate, written as a single exponential of the gap between its score and the lse summary.
Useful corollary: when score c = max_c' score, the gap closes in
the α → ∞ limit (since lseFinset α cands score → max), so the
exponent → 0 and the probability → 1. When score c < max, the
exponent → α · (score c - max) → −∞, so the probability → 0.
See Limit.lean for the limit theorem and Phase 4 for the formal
argmax-winner characterisation.
The raw ratio form of the partition-function identity.
Sometimes more convenient than _eq_exp_gap when chaining with
inequalities on the partition function (e.g., comparing softmax
probabilities of two different candidates: the lse term cancels).