Learning Models #
Three families of learning model, built on the Luce choice rule
(RationalAction): linear (ratio-scale) and probability-space learning,
stochastic gradient ascent, and Rescorla-Wagner associative learning.
Main definitions #
LinearLearner,LinearLearner.update,LinearLearner.iterate— the linear (alpha) learning rule on ratio-scale values,vₙ₊₁(a) = α·vₙ(a) + (1-α)·r(a).BetaModel,BetaModel.update— probability-space learning,Pₙ₊₁(a) = (1-β)·Pₙ(a) + β·δ(a, chosen).sgaUpdate,glaUpdate— stochastic gradient ascent and the Gradual Learning Algorithm ([Boe98]).RescorlaWagner,RescorlaWagner.update,RescorlaWagner.iterateConst— error-driven associative learning,ΔV_c = α_c · β · (λ − ΣV).
Main results #
linear_learning_preserves_axiom1— the Luce choice rule is preserved under a positive linear update.LinearLearner.iterate_closed_form,linear_convergence— the geometric-series solution and convergence of the values to the reinforcement.gla_eq_sga,sga_uses_correct_gradient— the GLA is SGA, and SGA follows the MaxEnt log-likelihood gradient.RescorlaWagner.totalStrength_convergence,RescorlaWagner.proportional_partition,RescorlaWagner.overshadowing,RescorlaWagner.blocking— convergence to the outcome maximum, the salience-proportional equilibrium, and the blocking and overshadowing predictions.
Implementation notes #
Axiom 1 preservation is structural rather than a bridge theorem:
LinearLearner.update produces a new RationalAction, and RationalAction.policy
is the Luce choice rule, so IIA holds at every trial by construction.
The Luce linear model and the Rescorla-Wagner total strength obey the same affine
recurrence xₙ₊₁ = r·xₙ + (1-r)·c; their closed forms and convergence proofs share
one geometric-series argument.
[RW72]'s prediction-error term (λ − ΣV) makes learning
competitive across cues, generating blocking (a fully predicted outcome teaches a
co-present cue nothing) and overshadowing (a more salient cue captures more
strength). [Ell06] applies both to second-language acquisition:
high-salience cues (word order, lexical content) block or overshadow low-salience
grammatical morphemes (tense inflection, plural marking) that encode the same
meanings.
The linear learner (alpha model) #
A linear learner for the Luce ratio-scale model ([Luc59]).
The learning rule is a positive linear operator on ratio-scale values:
v_{n+1}(a) = α · v_n(a) + (1-α) · r(a) where α ∈ (0,1) is the retention
rate and r : A → ℝ is the reinforcement function. The key property is that
this is a positive linear transformation — it maps non-negative scores to
non-negative scores — which is exactly what Luce's Axiom 1 preservation
requires.
- learnRate : ℝ
Retention rate: weight on prior score. Must be in (0,1).
- reinforcement : A → ℝ
Reinforcement: the asymptotic target score for each action.
Retention rate is strictly positive.
Retention rate is strictly less than 1.
Reinforcement values are non-negative.
Instances For
The complement 1 - α, which serves as the weight on reinforcement.
Equations
- ll.complementRate = 1 - ll.learnRate
Instances For
Linear update and choice-axiom preservation #
One step of linear learning ([Luc59], the Alpha Model):
v_{n+1}(s, a) = α · v_n(s, a) + (1 - α) · r(a).
The result is a new RationalAction, so the Luce choice rule (IIA) holds at
trial n+1 by construction.
Equations
- ll.update ra = { score := fun (s : S) (a : A) => ll.learnRate * ra.score s a + ll.complementRate * ll.reinforcement a, score_nonneg := ⋯ }
Instances For
Updated scores are non-negative — a direct consequence of the positivity of α, (1-α), the original scores, and the reinforcement values. This is what makes the linear learning model a positive linear operator ([Luc59]).
Axiom 1 preservation ([Luc59]):
If the Luce choice rule P(a) = v(a)/Σv(b) holds at trial n, then after a
positive linear update v' = α·v + (1-α)·r, the updated rule
P'(a) = v'(a)/Σv'(b) is again a Luce choice rule. This holds by construction:
LinearLearner.update produces a RationalAction, whose policy is the
Luce rule.
The beta model #
The β-model ([Luc59]; [BM55]):
An alternative learning model operating directly on probabilities rather than
ratio-scale values:
P_{n+1}(a) = (1-β) · P_n(a) + β · δ(a, chosen_n)
where δ(a, chosen_n) = 1 if a was chosen on trial n, 0 otherwise.
Unlike the linear model, this operates in probability space, so it does not
generally preserve the Luce ratio-scale structure.
- beta : ℝ
Learning rate in (0,1).
β is strictly positive.
β is strictly less than 1.
Instances For
One step of β-model update ([Luc59]):
P'(a) = (1-β) · P(a) + β · δ(a, chosen).
Takes a probability function and the chosen action, returns updated probabilities.
Instances For
β-model update preserves sum-to-one when the input sums to 1.
Iterated linear learning #
n-step iteration of linear learning ([Luc59]): apply the linear
update rule n times starting from an initial RationalAction.
Instances For
After n iterations, scores are still non-negative.
Closed form for iterated linear learning ([Luc59]): after n
iterations with retention rate α and reinforcement r,
vₙ(s, a) = αⁿ · v₀(s, a) + (1 - αⁿ) · r(a), the geometric-series solution to
the linear recurrence.
Asymptotic convergence #
Convergence of linear learning ([Luc59]): under constant
reinforcement, the ratio-scale values converge to the reinforcement values,
vₙ(s, a) → r(a) as n → ∞.
Expected choice sequences #
A record of a learning trial: what was available, what was chosen, what was the reinforcement. Used for analyzing sequences of choices over trials ([Luc59]).
- chosen : A
The action chosen on this trial.
- reinforcement : A → ℝ
The reinforcement received (may differ from the learner's default).
Instances For
An expected choice sequence: the initial agent, a learner, and a history of trials. This structure supports analyzing how choice probabilities evolve over a sequence of reinforced trials ([Luc59]).
- learner : LinearLearner A
The learning model governing updates.
- initial : RationalAction S A
The initial rational agent (trial 0).
- history : List (TrialRecord A)
History of trials (choices and reinforcements).
Instances For
The agent at trial n, after applying n steps of learning.
Instances For
The probability of choosing action a in state s at trial n.
Equations
- ecs.choiceProb n s a = (ecs.agentAt n).policy s a
Instances For
Choice probabilities at any trial sum to 1 (when totalScore is nonzero),
because each trial's agent is a RationalAction.
Stochastic gradient ascent (log-linear models) #
Stochastic Gradient Ascent update for a single weight of a log-linear model.
For MaxEnt phonology, observed_j is the violation count of constraint j on
the observed output, and expected_j is the expected violation count under the
current model distribution.
The batch gradient is E_emp[c_j] − E_r̄[c_j] (see hasDerivAt_logConditional
in RationalAction). The stochastic version replaces both expectations with
single-sample estimates.
Equations
- Core.sgaUpdate r_j η observed_j expected_j = r_j + η * (observed_j - expected_j)
Instances For
The Gradual Learning Algorithm ([Boe98]) update for a single weight: adjust by the signed difference between the observed violation count and the violation count of a hypothesis sampled from the current grammar.
Equations
- Core.glaUpdate r_j η c_j_observed c_j_hypothesis = r_j + η * (↑c_j_observed - ↑c_j_hypothesis)
Instances For
GLA = SGA: the Gradual Learning Algorithm is Stochastic Gradient Ascent with single-sample estimates of both observed and expected feature values. The update rules are identical by definition.
SGA converges to the global maximum of a concave objective.
For MaxEnt log-likelihood, the per-weight objective is concave
(logConditional_concaveOn in RationalAction), so SGA is guaranteed
to converge. This is the key advantage of MaxEnt over Stochastic OT,
where convergence of the GLA is not generally proved.
Rescorla-Wagner associative learning #
The Rescorla-Wagner learning model ([RW72]): on each trial every present cue has its associative strength updated by
ΔV_c = α_c · β · (λ − ΣV)
where α_c is cue salience, β the learning rate (outcome importance), λ the
maximum conditioning supported by the outcome, and ΣV the summed strength of
the cues present on the trial. The prediction-error term (λ − ΣV) is what makes
learning competitive across cues.
- salience : C → ℝ
Cue salience: how noticeable each cue is. Must be in [0, 1].
- learnRate : ℝ
Learning rate (outcome importance). Must be in (0, 1].
- maxCond : ℝ
Maximum conditioning supported by the outcome.
Salience is non-negative.
Salience is at most 1.
Learning rate is strictly positive.
Learning rate is at most 1.
Maximum conditioning is non-negative.
Instances For
Prediction error on a trial: λ − ΣV for cues present.
When positive, the outcome is under-predicted (surprise → learning). When zero, the outcome is fully predicted (no learning). When negative, the outcome is over-predicted (extinction).
Equations
- rw.predictionError present V = rw.maxCond - ∑ c ∈ present, V c
Instances For
One trial of Rescorla-Wagner learning ([RW72]).
For each cue c:
- If c is present:
V'(c) = V(c) + α_c · β · (λ − ΣV) - If c is absent:
V'(c) = V(c)(no change)
Equations
- rw.update present V c = if c ∈ present then V c + rw.salience c * rw.learnRate * rw.predictionError present V else V c
Instances For
Absent cues are not updated.
Blocking theorem ([RW72]; [Ell06]):
when cue A already fully predicts the outcome (V(A) = λ) and is the only cue
with nonzero strength among those present, adding a novel cue B to the compound
produces zero learning for B: V'(B) = V(B).
When the outcome is fully predicted, no present cue learns anything.
Iterated R-W learning over a sequence of trials. Each trial specifies which cues are present.
Equations
Instances For
Constant-trial iteration: the same cue set is presented on every trial.
Equations
- rw.iterateConst S V₀ 0 = V₀
- rw.iterateConst S V₀ n.succ = fun (c : C) => rw.update S (rw.iterateConst S V₀ n) c
Instances For
Rescorla-Wagner equilibrium, convergence, and overshadowing #
Total strength recurrence: the summed associative strength across present cues follows an affine recurrence after each trial,
ΣV' = ΣV + β · (Σ_{c∈S} α_c) · (λ − ΣV)
= (1 − β·A) · ΣV + β·A·λ where A = Σ_{c∈S} α_c
the same shape as the Luce α-model (cf. LinearLearner.iterate_closed_form) with
retention rate 1 − β·A, since the prediction error (λ − ΣV) is shared across
all present cues.
Total strength convergence: under repeated identical trials with the
same cue set S, the total associative strength converges to λ.
Proportional partition ([RW72]): when all cues start
with zero associative strength and the same cue set is presented on every trial,
each cue's strength at every step is proportional to its salience — there exists
Kₙ with Vₙ(c) = α_c · Kₙ for every present cue c. At convergence each cue's
strength is its salience-share of the outcome maximum λ.
Overshadowing ([RW72]; [Ell06]): when two
cues are both present on every trial, the more salient cue captures more
associative strength at every step (and hence at equilibrium). Immediate from
proportional_partition: Vₙ(c) = α_c · K, so Vₙ(A) = α_A · K > α_B · K = Vₙ(B)
whenever K > 0.
ΔP–R-W correspondence ([Ell06]; [CH95a]):
the blocked direction of a blocking experiment. If one cue already captures all
the associative strength and the novel cue starts at zero, the novel cue remains
at zero after compound trials — matching ΔP = 0 for the blocked cue.