Rational Action @cite{luce-1959} #
@cite{cover-thomas-2006} @cite{zaslavsky-hu-levy-2020} @cite{adams-messick-1958}The mathematical foundation for all soft-rational agents: RSA speakers/listeners, BToM agents, and decision-theoretic actors.
Architecture #
A RationalAction agent selects actions with probability proportional to a
non-negative score function — the Luce choice rule. This is the
unique choice rule satisfying IIA (independence of irrelevant alternatives):
the relative probability of two actions depends only on their scores.
The key mathematical results characterizing this choice rule are:
Softmax (§2): The exponential parameterization
score = exp(α · utility)givespolicy = softmax(utility, α). This is the standard form in RSA.Gibbs Variational Principle (§3): Softmax uniquely maximizes
H(p) + α · ⟨p, s⟩on the probability simplex. This is the mathematical foundation for RSA convergence.Maximum Entropy (§4): Softmax is the max-entropy distribution subject to an expected-utility constraint. Equivalently, it minimizes free energy (the Boltzmann distribution from statistical mechanics).
Bayesian Optimality (§5): The Bayesian posterior maximizes expected log-likelihood. This is the listener half of RSA convergence.
A rational action agent: selects actions with probability ∝ score(state, action).
This is the Luce choice rule. The score function is unnormalized;
normalization to a proper distribution is derived (see policy).
Key instances:
- RSA L0:
score(u, w) = prior(w) · meaning(u, w) - RSA S1:
score(w, u) = rpow(informativity(w, u), α) · exp(-α · cost(u)) - BToM agent:
score(state, action) = exp(β · Q(state, action))
- score : State → Action → ℝ
Unnormalized score: policy(a|s) ∝ score(s, a).
Scores are non-negative.
Instances For
Total score across all actions in a state (normalization constant).
Equations
- ra.totalScore s = ∑ a : A, ra.score s a
Instances For
Normalized policy: P(a|s) = score(s,a) / Σ_a' score(s,a'). Returns 0 for all actions if totalScore = 0.
Equations
- ra.policy s a = if ra.totalScore s = 0 then 0 else ra.score s a / ra.totalScore s
Instances For
Policy sums to 1 when totalScore is nonzero.
Zero score implies zero policy, regardless of whether totalScore is zero.
Policy respects score equality: equal scores → equal probabilities.
Follows directly from the Luce rule: both sides are score / totalScore
with the same denominator.
When totalScore equals the score of action a, the policy for a is 1.
Used by the compositional proof builder when all other scores are zero,
so totalScore = score a + 0 +... + 0 = score a, making policy = 1.
Score ordering implies ¬(policy strict ordering). Used by compositional proof builder for ¬(L1 w₁ < L1 w₂) goals.
Strict policy monotonicity: strictly higher score → strictly higher probability.
Used by rsa_decide to eliminate shared denominator computations: when
comparing policy s a₁ < policy s a₂ (same state), it suffices to show
score s a₁ < score s a₂, skipping the expensive totalScore computation
in the proof term.
Iff combining policy_lt_of_score_lt with the contrapositive of
policy_monotone: at the same state, policy strict ordering coincides
with score strict ordering.
Cross-state policy comparison: compares policy values at different states (different denominators). Used for S2 cross-world comparisons where S2(u|w₁) vs S2(u|w₂) have different normalization constants.
The cross-product condition score(s₁,a) * total(s₂) < score(s₂,a) * total(s₁)
is equivalent to score(s₁,a)/total(s₁) < score(s₂,a)/total(s₂) when both
totals are positive.
Cross-state policy comparison with positivity derived from the cross-product.
Like policy_lt_cross but derives the totalScore > 0 conditions from the
cross-product inequality itself: if 0 ≤ score(s₁,a) * total(s₂) < score(s₂,a) * total(s₁),
then score(s₂,a) * total(s₁) > 0, so both score(s₂,a) > 0 and total(s₁) > 0.
And score(s₂,a) ≤ total(s₂), so total(s₂) > 0.
Used by rsa_predict for cross-utterance L1 comparisons where the two sides
have different normalization constants.
Score-sum ordering implies policy-sum ordering when both sides share the same
state (same denominator). Used by rsa_predict for marginal L1 comparisons
where the worlds being summed differ but the utterance and config are shared.
Finset-sum ordering implies policy-sum ordering when both sides share the
same state (same denominator). Like policy_list_sum_lt but for Finset.sum.
Derives totalScore positivity from the score ordering itself, so no extra hypothesis is needed: if 0 ≤ Σ_{F₁} score < Σ_{F₂} score, then some score is positive, so totalScore > 0.
Used by rsa_predict for denominator cancellation in marginal comparisons.
Luce's Choice Axiom #
showed that the ratio rule P(a|s) = v(a)/Σv(b) is
characterized by the independence of irrelevant alternatives (IIA): the
relative probability of two actions depends only on their scores, not on what
other actions are available.
We formalize:
- The constant ratio rule (Theorem 2):
policy(a₁) · score(a₂) = policy(a₂) · score(a₁) - Choice from subsets (
pChoice): restriction of the choice rule to aFinset - IIA (Axiom 1): ratios in any subset equal score ratios
- The product rule (Theorem 1):
P(a,T) = P(a,S) · P(S,T)forS ⊆ T - Scale invariance (Theorem 5): multiplying all scores by
k > 0preserves policy - Uniqueness (Theorem 4, forward): proportional scores yield the same policy
Constant Ratio Rule (Theorem 2):
policy(a₁) · score(a₂) = policy(a₂) · score(a₁).
The odds ratio policy(a₁)/policy(a₂) = score(a₁)/score(a₂).
Choice probability from a subset: pChoice(a, T) = score(a) / Σ_{b∈T} score(b).
Returns 0 if a ∉ T or the subset total is 0.
Equations
Instances For
pChoice is non-negative.
pChoice sums to 1 over the subset when the subset total is nonzero.
IIA core: the ratio of pChoice values in any subset equals the score ratio.
For a₁, a₂ ∈ T with score(a₂) > 0:
pChoice(a₁, T) · score(a₂) = pChoice(a₂, T) · score(a₁) (Axiom 1).
IIA (Axiom 1): P(a, S) = P(a, T) / Σ_{b∈S} P(b, T) for S ⊆ T.
Choice probability from a subset is the conditional probability.
Product rule (Theorem 1):
P(a, T) = P(a, S) · P(S, T) for a ∈ S ⊆ T,
where P(S, T) = Σ_{b∈S} score(b) / Σ_{b∈T} score(b).
Scale all scores by a positive constant k.
Instances For
Scale invariance (Theorem 5): scaling scores by k > 0 preserves policy.
Uniqueness (forward direction, Theorem 4):
If scores are proportional (score'(s,a) = k · score(s,a) for some k > 0),
then both agents have the same policy.
Softmax Function #
The softmax function σ(s, α)ᵢ = exp(α · sᵢ) / Σⱼ exp(α · sⱼ) is the
exponential parameterization of the Luce choice rule. Following Franke & Degen
(submitted), we establish Facts 1–8.
The softmax function: softmax(s, α)ᵢ = exp(α · sᵢ) / Σⱼ exp(α · sⱼ).
Equations
- Core.softmax s α i = Real.exp (α * s i) / ∑ j : ι, Real.exp (α * s j)
Instances For
The partition function (normalizing constant) Z = Σⱼ exp(α · sⱼ).
Equations
- Core.partitionFn s α = ∑ j : ι, Real.exp (α * s j)
Instances For
Log-sum-exp: log of partition function.
Equations
- Core.logSumExp s α = Real.log (∑ j : ι, Real.exp (α * s j))
Instances For
The partition function is always positive.
Each softmax probability is positive. (Fact 1, part 1)
Softmax probabilities sum to 1. (Fact 1, part 2)
Softmax is non-negative.
Softmax is at most 1.
Fact 2: Odds are determined by score differences: pᵢ/pⱼ = exp(α(sᵢ - sⱼ)).
Log-odds equal scaled score difference.
Ratio form of Fact 2.
The logistic (sigmoid) function: S(x) = 1 / (1 + exp(−x)).
Equations
- Core.logistic x = 1 / (1 + Real.exp (-x))
Instances For
The logit function: L(p) = log(p / (1 − p)).
Inverse of logistic on (0, 1).
Equations
- Core.logit p = Real.log (p / (1 - p))
Instances For
Fact 3: For n = 2, softmax reduces to logistic.
Softmax log-odds equals logit of the binary softmax probability
(when there are exactly two alternatives).
Fact 6: Softmax is translation invariant.
Fact 8: Multiplicative scaling can be absorbed into α.
At α = 0, softmax is uniform.
Log of softmax = score minus log partition function.
Softmax is an exponential family distribution.
Luce choice with rpow scores equals softmax over log scores.
f(i)^α / Σⱼ f(j)^α = softmax(log ∘ f, α)(i) when all f(i) > 0.
This is the general identity connecting belief-based RSA (which uses
rpow) to the softmax framework (which uses exp). Every S1 model with
s1Score = rpow(l0, α) inherits all softmax limit theorems via this
identity: as α → ∞, rpow-based Luce choice concentrates on the
argmax of f, i.e., the most informative utterance.
Why Softmax? The Fechnerian Characterization #
The exponential parameterization score = exp(α · utility) is not a design
choice — it is the unique transformation connecting Luce's ratio scale to
a utility (interval) scale (§2.A; @cite{adams-messick-1958}).
Ratio vs interval scales. Luce's Axiom 1 (IIA) yields a ratio scale
v: only ratios v(a)/v(b) are meaningful (Theorem 4). Fechner's
psychophysics requires an interval scale u: only differences
u(a) - u(b) are meaningful. The question: how are v and u related?
Derivation. From P(a,b) = v(a)/(v(a)+v(b)), the odds ratio is
v(a)/v(b) = g(u(a) - u(b)) for some function g. Transitivity of ratios
(v(a)/v(c) = [v(a)/v(b)] · [v(b)/v(c)]) forces Cauchy's multiplicative
functional equation: g(s + t) = g(s) · g(t). The unique monotone increasing
solution is g(s) = exp(k · s) (cauchy_mul_exp), giving:
v(a) = C · exp(k · u(a))— the ratio scale IS the exponential of utilityP(a,b) = 1/(1 + exp(-k · (u(a) - u(b))))— the logistic function- For n alternatives:
P(a|S) = exp(k · u(a)) / Σ exp(k · u(b))— softmax
The parameter k > 0 is the rationality parameter α in RSA.
Cauchy's multiplicative functional equation (classical):
If g : ℝ → ℝ satisfies g(s + t) = g(s) · g(t) and is strictly
monotone increasing, then g(s) = exp(k · s) for some k > 0.
The proof reduces to the additive Cauchy equation via log: setting
h = log ∘ g, the multiplicative equation becomes h(s+t) = h(s) + h(t).
The key lemma (cauchy_monotone_additive_linear) shows that a strictly
monotone additive function must be linear, by density of ℚ in ℝ:
h agrees with x ↦ k·x on rationals (by induction), and any
deviation on an irrational x would violate monotonicity via a
rational witness between x and h(x)/k.
Fechnerian uniqueness (§2.A; @cite{adams-messick-1958}):
If a ratio scale v and interval scale u represent the same
ordering via v(x)/v(y) = g(u(x) - u(y)) for a strictly monotone
multiplicative g, then v is the exponential of u.
This is WHY fromSoftmax uses exp(α · utility): the exponential
is forced by the requirement that log-odds be linear in utility
differences. It is the unique bridge between Luce's ratio scale
(Chapter 1) and Fechner's interval scale (Chapter 2).
Construct a RationalAction from a utility function via softmax.
The score is exp(α · utility(s, a)), so policy = softmax(utility, α).
The exponential parameterization is forced by the Fechnerian characterization
(luce_fechnerian_exp): it is the unique bridge from Luce's ratio scale
to an additive utility scale.
Equations
- Core.RationalAction.fromSoftmax utility α = { score := fun (s : S) (a : A) => Real.exp (α * utility s a), score_nonneg := ⋯ }
Instances For
The policy of a softmax agent equals the softmax function.
Gibbs Variational Principle #
The softmax distribution uniquely maximizes entropy + expected score on the probability simplex. This is the mathematical foundation for RSA convergence (@cite{zaslavsky-hu-levy-2020}, Proposition 1).
Proof strategy #
The Gibbs VP reduces to KL non-negativity via three identities:
- H(p) + KL(p‖q) = -∑ pᵢ log qᵢ (negMulLog + KL term telescope)
- -∑ pᵢ log qᵢ = -α⟨p,s⟩ + log Z (substitute log qᵢ = α sᵢ - log Z)
- H(q) + α⟨q,s⟩ = log Z (softmax self-information)
Combining: H(p) + α⟨p,s⟩ + KL = log Z = H(q) + α⟨q,s⟩, so KL ≥ 0 ⟹ LHS ≤ RHS.
The KL machinery used here — klFinite, kl_eq_sum_klFun, kl_nonneg —
is inlined as private (ι→ℝ) helpers in this file (mathlib gap; the PMF
form lives at PMF.klDiv / PMF.toReal_klDiv_eq_sum_log_div).
The per-meaning speaker objective: f(s) = Σᵤ [negMulLog(sᵤ) + α · sᵤ · vᵤ].
This is the function that the speaker maximizes for each meaning m, where vᵤ = log L(m|u) is the listener utility of utterance u.
Equations
- Core.speakerObj v α s = ∑ u : ι, ((s u).negMulLog + α * s u * v u)
Instances For
The softmax achieves f(s*) = log Z, where Z is the partition function.
Gibbs Variational Principle: softmax maximizes entropy + expected score.
For any distribution p on ι and scores s : ι → ℝ: H(p) + α·⟨p, s⟩ ≤ H(q) + α·⟨q, s⟩ where q = softmax(s, α) and H(p) = Σ negMulLog(pᵢ).
Softmax Concentration at High Rationality #
As the rationality parameter α → ∞, softmax concentrates all probability mass on the action with highest utility — i.e., softmax converges to argmax. This connects:
- MaxEnt phonology ↔ OT: a MaxEnt grammar with weights
wbecomes categorical (OT-like) as temperature → 0 (equivalently α → ∞). - RSA ↔ neo-Gricean pragmatics: a soft-rational RSA speaker becomes a hard-rational Gricean reasoner in the α → ∞ limit.
Proof sketch #
From softmax_odds, we have σᵢ / σⱼ = exp(α(sᵢ − sⱼ)). When sᵢ > sⱼ,
this ratio → ∞ as α → ∞, so σⱼ / σᵢ → 0. Since Σ σₖ = 1, the maximizer's
probability → 1 by squeezing: 1 - σ_max = Σ_{k≠max} σₖ, and each non-maximal
term → 0 (bounded by exp(-α · gap) where gap = sᵢ - sⱼ > 0).
Softmax → argmax limit: as α → ∞, softmax concentrates on the unique
maximizer. For any i_max with s i_max strictly greater than all other
scores, softmax(s, α)_{i_max} → 1.
This is the formal connection between MaxEnt (soft optimization) and OT (hard optimization): OT is the α → ∞ limit of MaxEnt.
Entropy + softmax interactions #
The (ι→ℝ)-typed Shannon entropy lives in Core/InformationTheory.lean as
entropy : Finset α → (α → ℝ) → ℝ. The PMF-typed
canonical form is PMF.entropy : PMF α → ℝ; the two agree by definition
on (ofRealWeightFn p).toRealFn = p for normalized p (see
PMF.ofRealWeightFn_toRealFn_eq). This section uses the (ι→ℝ) form because
softmax is a real-arithmetic construction; consumers wanting the PMF form
can wrap via PMF.ofRealWeightFn.
Maximum entropy is achieved by uniform distribution.
Proof: KL(p ‖ uniform) ≥ 0, and KL(p ‖ uniform) = log n - H(p).
Entropy of softmax: H(softmax(s, α)) = log Z - α · 𝔼[s].
Entropy-regularized objective: G_α(p, s) = ⟨s, p⟩ + (1/α) H(p).
Equations
- Core.entropyRegObjective s α p = ∑ i : ι, p i * s i + 1 / α * Core.entropy✝ Finset.univ p
Instances For
The maximum value of the entropy-regularized objective.
Fact 5: Softmax maximizes the entropy-regularized objective.
Proof: gibbs_variational gives H(p) + α⟨p,s⟩ ≤ H(q) + α⟨q,s⟩;
dividing by α > 0 yields the result.
Softmax is the unique maximizer.
Proof: equality in the objective ⟹ KL(p ‖ softmax) = 0 (via speakerObj_plus_kl),
hence p = softmax (via kl_eq_zero_imp_eq).
Free energy (from statistical mechanics).
Equations
- Core.freeEnergy s α p = -∑ i : ι, p i * s i - 1 / α * Core.entropy✝ Finset.univ p
Instances For
Softmax is the Boltzmann distribution: minimizes free energy.
The log-partition function is convex in α.
Proof: By Hölder's inequality. For 0 < a, b with a + b = 1:
∑ exp(x·sᵢ)^a · exp(y·sᵢ)^b ≤ (∑ exp(x·sᵢ))^a · (∑ exp(y·sᵢ))^b
Since exp(x·sᵢ)^a · exp(y·sᵢ)^b = exp((ax+by)·sᵢ), taking logs gives
logSumExp(s, ax+by) ≤ a·logSumExp(s, x) + b·logSumExp(s, y).
Derivative of log-partition gives expected value:
d/dα log(Σ exp(α sᵢ)) = Σ softmax(s,α)ᵢ · sᵢ.
Proof via chain rule on log ∘ Σ exp(α · sᵢ), then hasDerivAt_finset_sum.
Partition function with per-element offsets: Z(α) = Σⱼ exp(α · sⱼ + rⱼ).
Equations
- Core.partitionFnOffset s r α = ∑ j : ι, Real.exp (α * s j + r j)
Instances For
Log-sum-exp with offsets: log Σ exp(α · sᵢ + rᵢ).
Equations
- Core.logSumExpOffset s r α = Real.log (Core.partitionFnOffset s r α)
Instances For
Softmax with offsets: exp(α · sᵢ + rᵢ) / Z(α).
Equations
- Core.softmaxOffset s r α i = Real.exp (α * s i + r i) / Core.partitionFnOffset s r α
Instances For
Derivative of offset log-partition gives weighted expected value:
d/dα log(Σ exp(α·sᵢ + rᵢ)) = Σ softmaxOffset(s,r,α)ᵢ · sᵢ.
The offset log-partition function is convex in α.
Same Hölder argument as logSumExp_convex: the key factoring
exp((ax+by)·sᵢ + rᵢ) = exp(x·sᵢ + rᵢ)^a · exp(y·sᵢ + rᵢ)^b
holds because a + b = 1, absorbing the offset.
The log conditional likelihood α ↦ (α · sᵧ + rᵧ) − logSumExpOffset(s,r,α)
is concave (affine minus convex).
The derivative of log conditional likelihood α ↦ (α·sᵧ + rᵧ) − logSumExpOffset
is the observed feature value minus the expected value:
d/dα = sᵧ − Σᵢ softmaxOffset(s,r,α)ᵢ · sᵢ.
Strong duality: max entropy = min free energy.
Bayesian optimality: The normalized posterior maximizes weighted log-likelihood on the probability simplex.
For weights wᵢ ≥ 0 with C = Σwᵢ > 0, and any distribution L on the simplex (Lᵢ > 0, Σ Lᵢ = 1), the normalized posterior p*ᵢ = wᵢ/C satisfies:
Σᵢ wᵢ · log(Lᵢ) ≤ Σᵢ wᵢ · log(p*ᵢ)
This is used for listener optimality: with wᵢ = P(m)·S(u|m), the Bayesian listener L*(m|u) = wᵢ/C maximizes Σ_m P(m)·S(u|m)·log L(m|u).
Uniqueness of the Ratio Scale #
Theorem 4 (proved earlier as policy_eq_of_proportional) shows that
proportional scores yield the same policy. The converse — that identical
policies force proportional scores — completes the uniqueness characterization:
same policy ↔ proportional scores, with proportionality constant
k = totalScore₂/totalScore₁.
Note: This is distinct from the "Independence-of-Unit Condition" in §1.F (pp. 28–33), which concerns state-dependent transformations f satisfying f(kv) = kf(v). That condition addresses how scale values transform across experimental conditions, not the uniqueness of the scale within a condition.
Converse of Theorem 4 (uniqueness of ratio scale):
If two agents with positive total scores have the same policy,
then their scores are proportional with constant k = total₂/total₁.
Full uniqueness characterization (Theorem 4 and its converse): Two agents with positive total scores have the same policy if and only if their scores are proportional.
Alternative Forms of Axiom 1 #
proves three equivalent formulations of the choice axiom:
(a) Ratio form: There exists a positive function v such that
P(x, T) = v(x) / Σ_{y∈T} v(y) for all x ∈ T.
(b) Product rule: P(x, T) = P(x, S) · P(S, T) for x ∈ S ⊆ T,
where P(S, T) = Σ_{y∈S} P(y, T).
(c) Pairwise independence: The odds ratio P(x,{x,y}) · P(y,T) = P(y,{x,y}) · P(x,T) — pairwise ratios are preserved in any superset.
The ratio form (a) is the definition of RationalAction; (a)→(b) is
product_rule and (a)→(c) is pChoice_ratio.
A choice function on finite subsets — the canonical Luce 1959 form (Definition 1, p. 5: "P(a, T) … is a probability distribution on T").
For each non-empty subset T ⊆ A, prob T : A → ℝ is a probability
distribution: non-negative, vanishes outside T, sums to 1 inside.
Specializes to:
- Binary choice via
cf.binary x y := cf.prob {x, y} x(seeChoiceFn.binary). Complementarity follows fromprob_sum_eq_one(seebinary_complement); no separateBinaryChoiceFnstructure needed. - Mathlib
PMFwhen restricted to a fixed support:prob T : A → ℝis a probability distribution conditioned on choice fromT.
The Luce axioms (hasRatioScale, hasProductRule, hasPairwiseIIA)
are stated as properties below, asserting structural facts about which
choice functions admit ratio-scale representations.
- prob : Finset A → A → ℝ
P(a, T): probability of choosing
afrom setT Probabilities are non-negative
Zero probability outside the choice set
- prob_sum_eq_one (T : Finset A) : T.Nonempty → ∑ a ∈ T, self.prob T a = 1
Probabilities sum to 1 within the choice set (Luce 1959 Def 1)
Instances For
Binary choice view: binary cf x y is the probability of choosing
x over y in the binary forced choice between them.
Defined via cf.prob {x, y} x. Replaces the legacy BinaryChoiceFn
structure in UtilityTheory.lean (see binary_complement for
complementarity).
Instances For
Binary complementarity: cf.binary x y + cf.binary y x = 1
when x ≠ y. Derived from prob_sum_eq_one over {x, y}.
Axiom 1, Form (a): ratio scale representation.
There exists v > 0 such that P(x, T) = v(x) / Σ v(y).
Equations
- cf.hasRatioScale = ∃ (v : A → ℝ), (∀ (a : A), 0 < v a) ∧ ∀ (T : Finset A), ∀ a ∈ T, cf.prob T a = v a / ∑ b ∈ T, v b
Instances For
Axiom 1, Form (b): product rule.
P(x, T) = P(x, S) · Σ_{y∈S} P(y, T) for x ∈ S ⊆ T.
Equations
- cf.hasProductRule = ∀ (S T : Finset A), S ⊆ T → S.Nonempty → ∑ b ∈ T, cf.prob T b = 1 → ∀ a ∈ S, cf.prob T a = cf.prob S a * ∑ b ∈ S, cf.prob T b
Instances For
Axiom 1, Form (c): pairwise independence (IIA).
The odds ratio is preserved in any superset:
P(x,T) · P(y,{x,y}) = P(y,T) · P(x,{x,y}).
Equations
- cf.hasPairwiseIIA = ∀ (T : Finset A) (a b : A), a ∈ T → b ∈ T → cf.prob T a * cf.prob {a, b} b = cf.prob T b * cf.prob {a, b} a
Instances For
(a) → (b): Ratio form implies product rule (Appendix 1).
(a) → (c): Ratio form implies pairwise IIA (Appendix 1).
(c) → (a): Pairwise IIA implies ratio form (Appendix 1).
The ratio scale is constructed by fixing a reference element x₀ and
setting v(x) = P(x, {x, x₀}) / P(x₀, {x, x₀}). Requires strict positivity
for elements in the choice set; normalization (probabilities sum to 1)
is now structural (via prob_sum_eq_one).
Axiom 1 equivalence (Appendix 1):
Ratio form ↔ pairwise IIA (under strict positivity; normalization is
now structural via prob_sum_eq_one).