Softmax Function: Limit Behavior #
§1. Pointwise limits: α → 0 (uniform), α → ∞ (argmax), α → -∞ (argmin).
§2. BToM observer: softmaxObserver_tendsto_one — an observer watching a
softmax agent concentrates on the uniquely optimal state as α → ∞.
This is the RSA–exhaustivity bridge: L1 at α → ∞ computes exh.
The set of indices achieving the maximum score.
Equations
- Softmax.argmaxSet s = {i : ι | ∀ (j : ι), s j ≤ s i}
Instances For
The set of indices achieving the minimum score.
Equations
- Softmax.argminSet s = {i : ι | ∀ (j : ι), s i ≤ s j}
Instances For
Fact 4: As α → 0, softmax converges to uniform distribution.
The ratio of non-max to max probability vanishes as α → ∞.
At the maximum, softmax → 1 as α → ∞. Helper lemma.
When there's a unique maximum, softmax concentrates on it as α → ∞.
Log-probability difference grows unboundedly.
As α → -∞, softmax concentrates on the minimum.
The IBR limit: hardmax selector.
Equations
- Softmax.hardmax s i_max h_unique i = if i = i_max then 1 else 0
Instances For
Softmax converges to hardmax as α → ∞ (when maximum is unique).
Shannon entropy of a distribution.
Equations
- Softmax.entropy p = -∑ i : ι, p i * Real.log (p i)
Instances For
Maximum possible entropy (uniform distribution).
Equations
- Softmax.maxEntropy ι = Real.log ↑(Fintype.card ι)
Instances For
As α → 0, entropy of softmax approaches maximum.
As α → ∞ (with unique max), entropy approaches 0.
Exponential rate of concentration.
For practical computation: when is softmax close enough to hardmax?
If action j has strictly higher score than i, softmax(i, α) → 0 as α → ∞.
Weaker than tendsto_softmax_infty_unique_max: does not require a unique
maximum, just that i is beaten by some j.
Observer posterior for a softmax agent: BToM inference about the state.
An observer sees a softmax agent choose action i, and infers the state: P(s | i, α) = softmax(score(·,s), α)(i) · prior(s) / Σ_{s'} ...
In RSA: the pragmatic listener L1 observes the speaker's utterance u and
infers the world w. This is L1(w | u) parameterized by α for limits.
Equations
- Softmax.softmaxObserver score prior α i s = Core.softmax (fun (j : ι) => score j s) α i * prior s / ∑ s' : σ, Core.softmax (fun (j : ι) => score j s') α i * prior s'
Instances For
BToM inference about a softmax agent concentrates on the optimal state.
An observer watches a softmax agent and infers the hidden state. If action i₀ is the unique best action at state s₀, and is not the best at any other state, the observer's posterior P(s₀ | i₀, α) → 1 as α → ∞.
RSA–exhaustivity bridge: the pragmatic listener (L1) observes a softmax speaker (S1) and does Bayesian inversion. At α → ∞, L1 concentrates on worlds where the heard utterance was the speaker's uniquely optimal choice. This IS the exhaustivity operator: L1 at α → ∞ computes exh(u).
For a simple scale {some, all} with worlds {someNotAll, all}:
- "all" is more informative at the "all" world → S1 says "all" there
- "some" is the only true utterance at "someNotAll" → S1 says "some" there
- L1 hearing "some" → concentrates on "someNotAll" → exh(some) = some ∧ ¬all
softmaxObserver is invariant under state-dependent constant shifts.
Adding c(s) to all action scores at state s doesn't change the
observer's posterior, because softmax is translation-invariant
(softmax_add_const).
Open conjectures #
Four open questions about the algebraic and emergent properties of RSA
(previously stubbed in the deleted Core/Conjectures.lean).
TODO: Fixed-point uniqueness — RSA iteration converges to a
unique fixed point for any rationality α > 0.
TODO: Lexicon-refinement monotonicity — if meaning₂ refines
meaning₁ (i.e., meaning₂ u w → meaning₁ u w), then
L1 meaning₂ u w ≤ L1 meaning₁ u w. Refining lexical meanings can only
strengthen pragmatic inferences, never weaken them.
TODO: Tropical limit — in the α → ∞ limit, soft-max S1
converges to iterated best response (argmax / tropical semiring): for
every ε > 0, (S1 α u w − bestResponse u w)² < ε eventually.
TODO: Neural-symbolic emergence — there exists a coarsening map
from a language model's token-level predictions to world-level
predictions such that the result approximates L1 to arbitrary
ε-precision. Bridges next-token modeling to Gricean reasoning.