Documentation

Linglib.Core.Probability.Choice.Learning

Learning Models #

[Luc59] [BM55] [RW72]

Three families of learning model, built on the Luce choice rule (RationalAction): linear (ratio-scale) and probability-space learning, stochastic gradient ascent, and Rescorla-Wagner associative learning.

Main definitions #

LinearLearner, LinearLearner.update, LinearLearner.iterate — the linear (alpha) learning rule on ratio-scale values, vₙ₊₁(a) = α·vₙ(a) + (1-α)·r(a).
BetaModel, BetaModel.update — probability-space learning, Pₙ₊₁(a) = (1-β)·Pₙ(a) + β·δ(a, chosen).
sgaUpdate, glaUpdate — stochastic gradient ascent and the Gradual Learning Algorithm ([Boe98]).
RescorlaWagner, RescorlaWagner.update, RescorlaWagner.iterateConst — error-driven associative learning, ΔV_c = α_c · β · (λ − ΣV).

Main results #

linear_learning_preserves_axiom1 — the Luce choice rule is preserved under a positive linear update.
LinearLearner.iterate_closed_form, linear_convergence — the geometric-series solution and convergence of the values to the reinforcement.
gla_eq_sga, sga_uses_correct_gradient — the GLA is SGA, and SGA follows the MaxEnt log-likelihood gradient.
RescorlaWagner.totalStrength_convergence, RescorlaWagner.proportional_partition, RescorlaWagner.overshadowing, RescorlaWagner.blocking — convergence to the outcome maximum, the salience-proportional equilibrium, and the blocking and overshadowing predictions.

Implementation notes #

Axiom 1 preservation is structural rather than a bridge theorem: LinearLearner.update produces a new RationalAction, and RationalAction.policy is the Luce choice rule, so IIA holds at every trial by construction.

The Luce linear model and the Rescorla-Wagner total strength obey the same affine recurrence xₙ₊₁ = r·xₙ + (1-r)·c; their closed forms and convergence proofs share one geometric-series argument.

[RW72]'s prediction-error term (λ − ΣV) makes learning competitive across cues, generating blocking (a fully predicted outcome teaches a co-present cue nothing) and overshadowing (a more salient cue captures more strength). [Ell06] applies both to second-language acquisition: high-salience cues (word order, lexical content) block or overshadow low-salience grammatical morphemes (tense inflection, plural marking) that encode the same meanings.

The linear learner (alpha model) #

structure Core.LinearLearner (A : Type u_1) :

Type u_1

A linear learner for the Luce ratio-scale model ([Luc59]).

The learning rule is a positive linear operator on ratio-scale values: v_{n+1}(a) = α · v_n(a) + (1-α) · r(a) where α ∈ (0,1) is the retention rate and r : A → ℝ is the reinforcement function. The key property is that this is a positive linear transformation — it maps non-negative scores to non-negative scores — which is exactly what Luce's Axiom 1 preservation requires.

learnRate : ℝ
Retention rate: weight on prior score. Must be in (0,1).
reinforcement : A → ℝ
Reinforcement: the asymptotic target score for each action.
learnRate_pos : 0 < self.learnRate
Retention rate is strictly positive.
learnRate_lt_one : self.learnRate < 1
Retention rate is strictly less than 1.
reinforcement_nonneg (a : A) : 0 ≤ self.reinforcement a
Reinforcement values are non-negative.

Instances For

def Core.LinearLearner.complementRate {A : Type u_2} (ll : LinearLearner A) :

ℝ

The complement 1 - α, which serves as the weight on reinforcement.

Equations

ll.complementRate = 1 - ll.learnRate

Instances For

theorem Core.LinearLearner.complementRate_pos {A : Type u_2} (ll : LinearLearner A) :

0 < ll.complementRate

theorem Core.LinearLearner.complementRate_lt_one {A : Type u_2} (ll : LinearLearner A) :

ll.complementRate < 1

Linear update and choice-axiom preservation #

def Core.LinearLearner.update {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) :

RationalAction S A

One step of linear learning ([Luc59], the Alpha Model): v_{n+1}(s, a) = α · v_n(s, a) + (1 - α) · r(a).

The result is a new RationalAction, so the Luce choice rule (IIA) holds at trial n+1 by construction.

Equations

ll.update ra = { score := fun (s : S) (a : A) => ll.learnRate * ra.score s a + ll.complementRate * ll.reinforcement a, score_nonneg := ⋯ }

Instances For

theorem Core.linear_update_nonneg {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :

0 ≤ (ll.update ra).score s a

Updated scores are non-negative — a direct consequence of the positivity of α, (1-α), the original scores, and the reinforcement values. This is what makes the linear learning model a positive linear operator ([Luc59]).

theorem Core.linear_learning_preserves_axiom1 {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :

(ll.update ra).policy s a = if (ll.update ra).totalScore s = 0 then 0 else (ll.update ra).score s a / (ll.update ra).totalScore s

Axiom 1 preservation ([Luc59]):

If the Luce choice rule P(a) = v(a)/Σv(b) holds at trial n, then after a positive linear update v' = α·v + (1-α)·r, the updated rule P'(a) = v'(a)/Σv'(b) is again a Luce choice rule. This holds by construction: LinearLearner.update produces a RationalAction, whose policy is the Luce rule.

The beta model #

structure Core.BetaModel (A : Type u_3) :

The β-model ([Luc59]; [BM55]):

An alternative learning model operating directly on probabilities rather than ratio-scale values: P_{n+1}(a) = (1-β) · P_n(a) + β · δ(a, chosen_n)

where δ(a, chosen_n) = 1 if a was chosen on trial n, 0 otherwise. Unlike the linear model, this operates in probability space, so it does not generally preserve the Luce ratio-scale structure.

beta : ℝ
Learning rate in (0,1).
beta_pos : 0 < self.beta
β is strictly positive.
beta_lt_one : self.beta < 1
β is strictly less than 1.

Instances For

def Core.BetaModel.update {A : Type u_2} [DecidableEq A] (bm : BetaModel A) (P : A → ℝ) (chosen : A) :

A → ℝ

One step of β-model update ([Luc59]): P'(a) = (1-β) · P(a) + β · δ(a, chosen).

Takes a probability function and the chosen action, returns updated probabilities.

Equations

bm.update P chosen a = (1 - bm.beta) * P a + bm.beta * if a = chosen then 1 else 0

Instances For

theorem Core.BetaModel.update_nonneg {A : Type u_2} [DecidableEq A] (bm : BetaModel A) (P : A → ℝ) (chosen : A) (hP : ∀ (a : A), 0 ≤ P a) (a : A) :

0 ≤ bm.update P chosen a

β-model update preserves non-negativity of probabilities.

theorem Core.BetaModel.update_sum_one {A : Type u_2} [Fintype A] [DecidableEq A] (bm : BetaModel A) (P : A → ℝ) (chosen : A) (h_chosen : chosen ∈ Finset.univ) (h_sum : ∑ a : A, P a = 1) :

∑ a : A, bm.update P chosen a = 1

β-model update preserves sum-to-one when the input sums to 1.

Iterated linear learning #

def Core.LinearLearner.iterate {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) :

ℕ → RationalAction S A

n-step iteration of linear learning ([Luc59]): apply the linear update rule n times starting from an initial RationalAction.

Equations

ll.iterate ra 0 = ra
ll.iterate ra n.succ = ll.update (ll.iterate ra n)

Instances For

theorem Core.LinearLearner.iterate_nonneg {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (n : ℕ) (s : S) (a : A) :

0 ≤ (ll.iterate ra n).score s a

After n iterations, scores are still non-negative.

theorem Core.LinearLearner.iterate_closed_form {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (n : ℕ) (s : S) (a : A) :

(ll.iterate ra n).score s a = ll.learnRate ^ n * ra.score s a + (1 - ll.learnRate ^ n) * ll.reinforcement a

Closed form for iterated linear learning ([Luc59]): after n iterations with retention rate α and reinforcement r, vₙ(s, a) = αⁿ · v₀(s, a) + (1 - αⁿ) · r(a), the geometric-series solution to the linear recurrence.

Asymptotic convergence #

theorem Core.linear_convergence {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :

Filter.Tendsto (fun (n : ℕ) => (ll.iterate ra n).score s a) Filter.atTop (nhds (ll.reinforcement a))

Convergence of linear learning ([Luc59]): under constant reinforcement, the ratio-scale values converge to the reinforcement values, vₙ(s, a) → r(a) as n → ∞.

Expected choice sequences #

structure Core.TrialRecord (A : Type u_3) :

Type u_3

A record of a learning trial: what was available, what was chosen, what was the reinforcement. Used for analyzing sequences of choices over trials ([Luc59]).

chosen : A
The action chosen on this trial.
reinforcement : A → ℝ
The reinforcement received (may differ from the learner's default).

Instances For

structure Core.ExpectedChoiceSequence (S : Type u_3) (A : Type u_4) [Fintype A] :

Type (max u_3 u_4)

An expected choice sequence: the initial agent, a learner, and a history of trials. This structure supports analyzing how choice probabilities evolve over a sequence of reinforced trials ([Luc59]).

learner : LinearLearner A
The learning model governing updates.
initial : RationalAction S A
The initial rational agent (trial 0).
history : List (TrialRecord A)
History of trials (choices and reinforcements).

Instances For

def Core.ExpectedChoiceSequence.agentAt {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ℕ) :

RationalAction S A

The agent at trial n, after applying n steps of learning.

Equations

ecs.agentAt n = ecs.learner.iterate ecs.initial n

Instances For

noncomputable def Core.ExpectedChoiceSequence.choiceProb {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ℕ) (s : S) (a : A) :

ℝ

The probability of choosing action a in state s at trial n.

Equations

ecs.choiceProb n s a = (ecs.agentAt n).policy s a

Instances For

theorem Core.ExpectedChoiceSequence.choiceProb_sum_one {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ℕ) (s : S) (h : (ecs.agentAt n).totalScore s ≠ 0) :

∑ a : A, ecs.choiceProb n s a = 1

Choice probabilities at any trial sum to 1 (when totalScore is nonzero), because each trial's agent is a RationalAction.

Stochastic gradient ascent (log-linear models) #

def Core.sgaUpdate (r_j η observed_j expected_j : ℝ) :

ℝ

Stochastic Gradient Ascent update for a single weight of a log-linear model.

For MaxEnt phonology, observed_j is the violation count of constraint j on the observed output, and expected_j is the expected violation count under the current model distribution.

The batch gradient is E_emp[c_j] − E_r̄[c_j] (see hasDerivAt_logConditional in RationalAction). The stochastic version replaces both expectations with single-sample estimates.

Equations

Core.sgaUpdate r_j η observed_j expected_j = r_j + η * (observed_j - expected_j)

Instances For

def Core.glaUpdate (r_j η : ℝ) (c_j_observed c_j_hypothesis : ℕ) :

ℝ

The Gradual Learning Algorithm ([Boe98]) update for a single weight: adjust by the signed difference between the observed violation count and the violation count of a hypothesis sampled from the current grammar.

Equations

Core.glaUpdate r_j η c_j_observed c_j_hypothesis = r_j + η * (↑c_j_observed - ↑c_j_hypothesis)

Instances For

theorem Core.gla_eq_sga (r_j η : ℝ) (obs hyp : ℕ) :

glaUpdate r_j η obs hyp = sgaUpdate r_j η ↑obs ↑hyp

GLA = SGA: the Gradual Learning Algorithm is Stochastic Gradient Ascent with single-sample estimates of both observed and expected feature values. The update rules are identical by definition.

theorem Core.sga_uses_correct_gradient {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι → ℝ) (y : ι) (wⱼ : ℝ) :

HasDerivAt (fun (w : ℝ) => w * s y + r y - logSumExp (w • s + r)) (s y - ∑ i : ι, softmax (wⱼ • s + r) i * s i) wⱼ

SGA converges to the global maximum of a concave objective.

For MaxEnt log-likelihood, the per-weight objective is concave (logConditional_concaveOn in RationalAction), so SGA is guaranteed to converge. This is the key advantage of MaxEnt over Stochastic OT, where convergence of the GLA is not generally proved.

Rescorla-Wagner associative learning #

structure Core.RescorlaWagner (C : Type u_1) :

Type u_1

The Rescorla-Wagner learning model ([RW72]): on each trial every present cue has its associative strength updated by

ΔV_c = α_c · β · (λ − ΣV)

where α_c is cue salience, β the learning rate (outcome importance), λ the maximum conditioning supported by the outcome, and ΣV the summed strength of the cues present on the trial. The prediction-error term (λ − ΣV) is what makes learning competitive across cues.

salience : C → ℝ
Cue salience: how noticeable each cue is. Must be in [0, 1].
learnRate : ℝ
Learning rate (outcome importance). Must be in (0, 1].
maxCond : ℝ
Maximum conditioning supported by the outcome.
salience_nonneg (c : C) : 0 ≤ self.salience c
Salience is non-negative.
salience_le_one (c : C) : self.salience c ≤ 1
Salience is at most 1.
learnRate_pos : 0 < self.learnRate
Learning rate is strictly positive.
learnRate_le_one : self.learnRate ≤ 1
Learning rate is at most 1.
maxCond_nonneg : 0 ≤ self.maxCond
Maximum conditioning is non-negative.

Instances For

def Core.RescorlaWagner.predictionError {C : Type u_1} (rw : RescorlaWagner C) (present : Finset C) (V : C → ℝ) :

ℝ

Prediction error on a trial: λ − ΣV for cues present.

When positive, the outcome is under-predicted (surprise → learning). When zero, the outcome is fully predicted (no learning). When negative, the outcome is over-predicted (extinction).

Equations

rw.predictionError present V = rw.maxCond - ∑ c ∈ present, V c

Instances For

def Core.RescorlaWagner.update {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C → ℝ) (c : C) :

ℝ

One trial of Rescorla-Wagner learning ([RW72]).

For each cue c:

If c is present: V'(c) = V(c) + α_c · β · (λ − ΣV)
If c is absent: V'(c) = V(c) (no change)

Equations

rw.update present V c = if c ∈ present then V c + rw.salience c * rw.learnRate * rw.predictionError present V else V c

Instances For

theorem Core.RescorlaWagner.update_absent {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C → ℝ) (c : C) (hc : c ∉ present) :

rw.update present V c = V c

Absent cues are not updated.

theorem Core.RescorlaWagner.blocking {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C → ℝ) (B : C) (hB : B ∈ present) (h_total : ∑ c ∈ present, V c = rw.maxCond) :

rw.update present V B = V B

Blocking theorem ([RW72]; [Ell06]): when cue A already fully predicts the outcome (V(A) = λ) and is the only cue with nonzero strength among those present, adding a novel cue B to the compound produces zero learning for B: V'(B) = V(B).

theorem Core.RescorlaWagner.no_learning_at_equilibrium {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C → ℝ) (h_total : ∑ c ∈ present, V c = rw.maxCond) (c : C) :

rw.update present V c = V c

When the outcome is fully predicted, no present cue learns anything.

def Core.RescorlaWagner.iterate {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (trials : List (Finset C)) (V₀ : C → ℝ) :

C → ℝ

Iterated R-W learning over a sequence of trials. Each trial specifies which cues are present.

Equations

rw.iterate trials V₀ = List.foldl (fun (V : C → ℝ) (present : Finset C) (c : C) => rw.update present V c) V₀ trials

Instances For

def Core.RescorlaWagner.iterateConst {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V₀ : C → ℝ) :

ℕ → C → ℝ

Constant-trial iteration: the same cue set is presented on every trial.

Equations

rw.iterateConst S V₀ 0 = V₀
rw.iterateConst S V₀ n.succ = fun (c : C) => rw.update S (rw.iterateConst S V₀ n) c

Instances For

Rescorla-Wagner equilibrium, convergence, and overshadowing #

theorem Core.RescorlaWagner.totalStrength_recurrence {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V : C → ℝ) :

∑ c ∈ S, rw.update S V c = ∑ c ∈ S, V c + rw.learnRate * rw.predictionError S V * ∑ c ∈ S, rw.salience c

Total strength recurrence: the summed associative strength across present cues follows an affine recurrence after each trial,

ΣV' = ΣV + β · (Σ_{c∈S} α_c) · (λ − ΣV)
    = (1 − β·A) · ΣV + β·A·λ       where A = Σ_{c∈S} α_c

the same shape as the Luce α-model (cf. LinearLearner.iterate_closed_form) with retention rate 1 − β·A, since the prediction error (λ − ΣV) is shared across all present cues.

theorem Core.RescorlaWagner.totalStrength_convergence {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V₀ : C → ℝ) (h_pos : 0 < rw.learnRate * ∑ c ∈ S, rw.salience c) (h_stable : rw.learnRate * ∑ c ∈ S, rw.salience c < 1) :

Filter.Tendsto (fun (n : ℕ) => ∑ c ∈ S, rw.iterateConst S V₀ n c) Filter.atTop (nhds rw.maxCond)

Total strength convergence: under repeated identical trials with the same cue set S, the total associative strength converges to λ.

theorem Core.RescorlaWagner.proportional_partition {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (n : ℕ) :

∃ (K : ℝ), ∀ c ∈ S, rw.iterateConst S (fun (x : C) => 0) n c = rw.salience c * K

Proportional partition ([RW72]): when all cues start with zero associative strength and the same cue set is presented on every trial, each cue's strength at every step is proportional to its salience — there exists Kₙ with Vₙ(c) = α_c · Kₙ for every present cue c. At convergence each cue's strength is its salience-share of the outcome maximum λ.

theorem Core.RescorlaWagner.overshadowing {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (A B : C) (n : ℕ) (hA : A ∈ S) (hB : B ∈ S) (_hne : A ≠ B) (h_salience : rw.salience B < rw.salience A) (h_pos : 0 < ∑ c ∈ S, rw.iterateConst S (fun (x : C) => 0) n c) :

rw.iterateConst S (fun (x : C) => 0) n B < rw.iterateConst S (fun (x : C) => 0) n A

Overshadowing ([RW72]; [Ell06]): when two cues are both present on every trial, the more salient cue captures more associative strength at every step (and hence at equilibrium). Immediate from proportional_partition: Vₙ(c) = α_c · K, so Vₙ(A) = α_A · K > α_B · K = Vₙ(B) whenever K > 0.

theorem Core.RescorlaWagner.blocked_cue_zero {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V : C → ℝ) (C' : C) (hC : C' ∈ S) (hV : V C' = 0) (h_total : ∑ c ∈ S, V c = rw.maxCond) :

rw.update S V C' = 0

ΔP–R-W correspondence ([Ell06]; [CH95a]): the blocked direction of a blocking experiment. If one cue already captures all the associative strength and the novel cue starts at zero, the novel cue remains at zero after compound trials — matching ΔP = 0 for the blocked cue.