Documentation

Linglib.Core.Probability.Choice.Learning

Learning Models #

[Luc59] [BM55] [RW72]

Three families of learning model, built on the Luce choice rule (RationalAction): linear (ratio-scale) and probability-space learning, stochastic gradient ascent, and Rescorla-Wagner associative learning.

Main definitions #

Main results #

Implementation notes #

Axiom 1 preservation is structural rather than a bridge theorem: LinearLearner.update produces a new RationalAction, and RationalAction.policy is the Luce choice rule, so IIA holds at every trial by construction.

The Luce linear model and the Rescorla-Wagner total strength obey the same affine recurrence xₙ₊₁ = r·xₙ + (1-r)·c; their closed forms and convergence proofs share one geometric-series argument.

[RW72]'s prediction-error term (λ − ΣV) makes learning competitive across cues, generating blocking (a fully predicted outcome teaches a co-present cue nothing) and overshadowing (a more salient cue captures more strength). [Ell06] applies both to second-language acquisition: high-salience cues (word order, lexical content) block or overshadow low-salience grammatical morphemes (tense inflection, plural marking) that encode the same meanings.

The linear learner (alpha model) #

structure Core.LinearLearner (A : Type u_1) :
Type u_1

A linear learner for the Luce ratio-scale model ([Luc59]).

The learning rule is a positive linear operator on ratio-scale values: v_{n+1}(a) = α · v_n(a) + (1-α) · r(a) where α ∈ (0,1) is the retention rate and r : A → ℝ is the reinforcement function. The key property is that this is a positive linear transformation — it maps non-negative scores to non-negative scores — which is exactly what Luce's Axiom 1 preservation requires.

  • learnRate :

    Retention rate: weight on prior score. Must be in (0,1).

  • reinforcement : A

    Reinforcement: the asymptotic target score for each action.

  • learnRate_pos : 0 < self.learnRate

    Retention rate is strictly positive.

  • learnRate_lt_one : self.learnRate < 1

    Retention rate is strictly less than 1.

  • reinforcement_nonneg (a : A) : 0 self.reinforcement a

    Reinforcement values are non-negative.

Instances For

    The complement 1 - α, which serves as the weight on reinforcement.

    Equations
    Instances For

      Linear update and choice-axiom preservation #

      def Core.LinearLearner.update {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) :

      One step of linear learning ([Luc59], the Alpha Model): v_{n+1}(s, a) = α · v_n(s, a) + (1 - α) · r(a).

      The result is a new RationalAction, so the Luce choice rule (IIA) holds at trial n+1 by construction.

      Equations
      Instances For
        theorem Core.linear_update_nonneg {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :
        0 (ll.update ra).score s a

        Updated scores are non-negative — a direct consequence of the positivity of α, (1-α), the original scores, and the reinforcement values. This is what makes the linear learning model a positive linear operator ([Luc59]).

        theorem Core.linear_learning_preserves_axiom1 {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :
        (ll.update ra).policy s a = if (ll.update ra).totalScore s = 0 then 0 else (ll.update ra).score s a / (ll.update ra).totalScore s

        Axiom 1 preservation ([Luc59]):

        If the Luce choice rule P(a) = v(a)/Σv(b) holds at trial n, then after a positive linear update v' = α·v + (1-α)·r, the updated rule P'(a) = v'(a)/Σv'(b) is again a Luce choice rule. This holds by construction: LinearLearner.update produces a RationalAction, whose policy is the Luce rule.

        The beta model #

        structure Core.BetaModel (A : Type u_3) :

        The β-model ([Luc59]; [BM55]):

        An alternative learning model operating directly on probabilities rather than ratio-scale values: P_{n+1}(a) = (1-β) · P_n(a) + β · δ(a, chosen_n)

        where δ(a, chosen_n) = 1 if a was chosen on trial n, 0 otherwise. Unlike the linear model, this operates in probability space, so it does not generally preserve the Luce ratio-scale structure.

        • beta :

          Learning rate in (0,1).

        • beta_pos : 0 < self.beta

          β is strictly positive.

        • beta_lt_one : self.beta < 1

          β is strictly less than 1.

        Instances For
          def Core.BetaModel.update {A : Type u_2} [DecidableEq A] (bm : BetaModel A) (P : A) (chosen : A) :
          A

          One step of β-model update ([Luc59]): P'(a) = (1-β) · P(a) + β · δ(a, chosen).

          Takes a probability function and the chosen action, returns updated probabilities.

          Equations
          • bm.update P chosen a = (1 - bm.beta) * P a + bm.beta * if a = chosen then 1 else 0
          Instances For
            theorem Core.BetaModel.update_nonneg {A : Type u_2} [DecidableEq A] (bm : BetaModel A) (P : A) (chosen : A) (hP : ∀ (a : A), 0 P a) (a : A) :
            0 bm.update P chosen a

            β-model update preserves non-negativity of probabilities.

            theorem Core.BetaModel.update_sum_one {A : Type u_2} [Fintype A] [DecidableEq A] (bm : BetaModel A) (P : A) (chosen : A) (h_chosen : chosen Finset.univ) (h_sum : a : A, P a = 1) :
            a : A, bm.update P chosen a = 1

            β-model update preserves sum-to-one when the input sums to 1.

            Iterated linear learning #

            def Core.LinearLearner.iterate {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) :
            RationalAction S A

            n-step iteration of linear learning ([Luc59]): apply the linear update rule n times starting from an initial RationalAction.

            Equations
            Instances For
              theorem Core.LinearLearner.iterate_nonneg {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (n : ) (s : S) (a : A) :
              0 (ll.iterate ra n).score s a

              After n iterations, scores are still non-negative.

              theorem Core.LinearLearner.iterate_closed_form {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (n : ) (s : S) (a : A) :
              (ll.iterate ra n).score s a = ll.learnRate ^ n * ra.score s a + (1 - ll.learnRate ^ n) * ll.reinforcement a

              Closed form for iterated linear learning ([Luc59]): after n iterations with retention rate α and reinforcement r, vₙ(s, a) = αⁿ · v₀(s, a) + (1 - αⁿ) · r(a), the geometric-series solution to the linear recurrence.

              Asymptotic convergence #

              theorem Core.linear_convergence {S : Type u_1} {A : Type u_2} [Fintype A] (ll : LinearLearner A) (ra : RationalAction S A) (s : S) (a : A) :
              Filter.Tendsto (fun (n : ) => (ll.iterate ra n).score s a) Filter.atTop (nhds (ll.reinforcement a))

              Convergence of linear learning ([Luc59]): under constant reinforcement, the ratio-scale values converge to the reinforcement values, vₙ(s, a) → r(a) as n → ∞.

              Expected choice sequences #

              structure Core.TrialRecord (A : Type u_3) :
              Type u_3

              A record of a learning trial: what was available, what was chosen, what was the reinforcement. Used for analyzing sequences of choices over trials ([Luc59]).

              • chosen : A

                The action chosen on this trial.

              • reinforcement : A

                The reinforcement received (may differ from the learner's default).

              Instances For
                structure Core.ExpectedChoiceSequence (S : Type u_3) (A : Type u_4) [Fintype A] :
                Type (max u_3 u_4)

                An expected choice sequence: the initial agent, a learner, and a history of trials. This structure supports analyzing how choice probabilities evolve over a sequence of reinforced trials ([Luc59]).

                • learner : LinearLearner A

                  The learning model governing updates.

                • initial : RationalAction S A

                  The initial rational agent (trial 0).

                • history : List (TrialRecord A)

                  History of trials (choices and reinforcements).

                Instances For
                  def Core.ExpectedChoiceSequence.agentAt {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ) :

                  The agent at trial n, after applying n steps of learning.

                  Equations
                  Instances For
                    noncomputable def Core.ExpectedChoiceSequence.choiceProb {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ) (s : S) (a : A) :

                    The probability of choosing action a in state s at trial n.

                    Equations
                    Instances For
                      theorem Core.ExpectedChoiceSequence.choiceProb_sum_one {S : Type u_1} {A : Type u_2} [Fintype A] (ecs : ExpectedChoiceSequence S A) (n : ) (s : S) (h : (ecs.agentAt n).totalScore s 0) :
                      a : A, ecs.choiceProb n s a = 1

                      Choice probabilities at any trial sum to 1 (when totalScore is nonzero), because each trial's agent is a RationalAction.

                      Stochastic gradient ascent (log-linear models) #

                      def Core.sgaUpdate (r_j η observed_j expected_j : ) :

                      Stochastic Gradient Ascent update for a single weight of a log-linear model.

                      For MaxEnt phonology, observed_j is the violation count of constraint j on the observed output, and expected_j is the expected violation count under the current model distribution.

                      The batch gradient is E_emp[c_j] − E_r̄[c_j] (see hasDerivAt_logConditional in RationalAction). The stochastic version replaces both expectations with single-sample estimates.

                      Equations
                      • Core.sgaUpdate r_j η observed_j expected_j = r_j + η * (observed_j - expected_j)
                      Instances For
                        def Core.glaUpdate (r_j η : ) (c_j_observed c_j_hypothesis : ) :

                        The Gradual Learning Algorithm ([Boe98]) update for a single weight: adjust by the signed difference between the observed violation count and the violation count of a hypothesis sampled from the current grammar.

                        Equations
                        • Core.glaUpdate r_j η c_j_observed c_j_hypothesis = r_j + η * (c_j_observed - c_j_hypothesis)
                        Instances For
                          theorem Core.gla_eq_sga (r_j η : ) (obs hyp : ) :
                          glaUpdate r_j η obs hyp = sgaUpdate r_j η obs hyp

                          GLA = SGA: the Gradual Learning Algorithm is Stochastic Gradient Ascent with single-sample estimates of both observed and expected feature values. The update rules are identical by definition.

                          theorem Core.sga_uses_correct_gradient {ι : Type u_1} [Fintype ι] [Nonempty ι] (s r : ι) (y : ι) (wⱼ : ) :
                          HasDerivAt (fun (w : ) => w * s y + r y - logSumExp (w s + r)) (s y - i : ι, softmax (wⱼ s + r) i * s i) wⱼ

                          SGA converges to the global maximum of a concave objective.

                          For MaxEnt log-likelihood, the per-weight objective is concave (logConditional_concaveOn in RationalAction), so SGA is guaranteed to converge. This is the key advantage of MaxEnt over Stochastic OT, where convergence of the GLA is not generally proved.

                          Rescorla-Wagner associative learning #

                          structure Core.RescorlaWagner (C : Type u_1) :
                          Type u_1

                          The Rescorla-Wagner learning model ([RW72]): on each trial every present cue has its associative strength updated by

                          ΔV_c = α_c · β · (λ − ΣV)
                          

                          where α_c is cue salience, β the learning rate (outcome importance), λ the maximum conditioning supported by the outcome, and ΣV the summed strength of the cues present on the trial. The prediction-error term (λ − ΣV) is what makes learning competitive across cues.

                          • salience : C

                            Cue salience: how noticeable each cue is. Must be in [0, 1].

                          • learnRate :

                            Learning rate (outcome importance). Must be in (0, 1].

                          • maxCond :

                            Maximum conditioning supported by the outcome.

                          • salience_nonneg (c : C) : 0 self.salience c

                            Salience is non-negative.

                          • salience_le_one (c : C) : self.salience c 1

                            Salience is at most 1.

                          • learnRate_pos : 0 < self.learnRate

                            Learning rate is strictly positive.

                          • learnRate_le_one : self.learnRate 1

                            Learning rate is at most 1.

                          • maxCond_nonneg : 0 self.maxCond

                            Maximum conditioning is non-negative.

                          Instances For
                            def Core.RescorlaWagner.predictionError {C : Type u_1} (rw : RescorlaWagner C) (present : Finset C) (V : C) :

                            Prediction error on a trial: λ − ΣV for cues present.

                            When positive, the outcome is under-predicted (surprise → learning). When zero, the outcome is fully predicted (no learning). When negative, the outcome is over-predicted (extinction).

                            Equations
                            Instances For
                              def Core.RescorlaWagner.update {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (c : C) :

                              One trial of Rescorla-Wagner learning ([RW72]).

                              For each cue c:

                              • If c is present: V'(c) = V(c) + α_c · β · (λ − ΣV)
                              • If c is absent: V'(c) = V(c) (no change)
                              Equations
                              Instances For
                                theorem Core.RescorlaWagner.update_absent {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (c : C) (hc : cpresent) :
                                rw.update present V c = V c

                                Absent cues are not updated.

                                theorem Core.RescorlaWagner.blocking {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (B : C) (hB : B present) (h_total : cpresent, V c = rw.maxCond) :
                                rw.update present V B = V B

                                Blocking theorem ([RW72]; [Ell06]): when cue A already fully predicts the outcome (V(A) = λ) and is the only cue with nonzero strength among those present, adding a novel cue B to the compound produces zero learning for B: V'(B) = V(B).

                                theorem Core.RescorlaWagner.no_learning_at_equilibrium {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (present : Finset C) (V : C) (h_total : cpresent, V c = rw.maxCond) (c : C) :
                                rw.update present V c = V c

                                When the outcome is fully predicted, no present cue learns anything.

                                def Core.RescorlaWagner.iterate {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (trials : List (Finset C)) (V₀ : C) :
                                C

                                Iterated R-W learning over a sequence of trials. Each trial specifies which cues are present.

                                Equations
                                • rw.iterate trials V₀ = List.foldl (fun (V : C) (present : Finset C) (c : C) => rw.update present V c) V₀ trials
                                Instances For
                                  def Core.RescorlaWagner.iterateConst {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V₀ : C) :
                                  C

                                  Constant-trial iteration: the same cue set is presented on every trial.

                                  Equations
                                  Instances For

                                    Rescorla-Wagner equilibrium, convergence, and overshadowing #

                                    theorem Core.RescorlaWagner.totalStrength_recurrence {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V : C) :
                                    cS, rw.update S V c = cS, V c + rw.learnRate * rw.predictionError S V * cS, rw.salience c

                                    Total strength recurrence: the summed associative strength across present cues follows an affine recurrence after each trial,

                                    ΣV' = ΣV + β · (Σ_{c∈S} α_c) · (λ − ΣV)
                                        = (1 − β·A) · ΣV + β·A·λ       where A = Σ_{c∈S} α_c
                                    

                                    the same shape as the Luce α-model (cf. LinearLearner.iterate_closed_form) with retention rate 1 − β·A, since the prediction error (λ − ΣV) is shared across all present cues.

                                    theorem Core.RescorlaWagner.totalStrength_convergence {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V₀ : C) (h_pos : 0 < rw.learnRate * cS, rw.salience c) (h_stable : rw.learnRate * cS, rw.salience c < 1) :
                                    Filter.Tendsto (fun (n : ) => cS, rw.iterateConst S V₀ n c) Filter.atTop (nhds rw.maxCond)

                                    Total strength convergence: under repeated identical trials with the same cue set S, the total associative strength converges to λ.

                                    theorem Core.RescorlaWagner.proportional_partition {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (n : ) :
                                    ∃ (K : ), cS, rw.iterateConst S (fun (x : C) => 0) n c = rw.salience c * K

                                    Proportional partition ([RW72]): when all cues start with zero associative strength and the same cue set is presented on every trial, each cue's strength at every step is proportional to its salience — there exists Kₙ with Vₙ(c) = α_c · Kₙ for every present cue c. At convergence each cue's strength is its salience-share of the outcome maximum λ.

                                    theorem Core.RescorlaWagner.overshadowing {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (A B : C) (n : ) (hA : A S) (hB : B S) (_hne : A B) (h_salience : rw.salience B < rw.salience A) (h_pos : 0 < cS, rw.iterateConst S (fun (x : C) => 0) n c) :
                                    rw.iterateConst S (fun (x : C) => 0) n B < rw.iterateConst S (fun (x : C) => 0) n A

                                    Overshadowing ([RW72]; [Ell06]): when two cues are both present on every trial, the more salient cue captures more associative strength at every step (and hence at equilibrium). Immediate from proportional_partition: Vₙ(c) = α_c · K, so Vₙ(A) = α_A · K > α_B · K = Vₙ(B) whenever K > 0.

                                    theorem Core.RescorlaWagner.blocked_cue_zero {C : Type u_1} [DecidableEq C] (rw : RescorlaWagner C) (S : Finset C) (V : C) (C' : C) (hC : C' S) (hV : V C' = 0) (h_total : cS, V c = rw.maxCond) :
                                    rw.update S V C' = 0

                                    ΔP–R-W correspondence ([Ell06]; [CH95a]): the blocked direction of a blocking experiment. If one cue already captures all the associative strength and the novel cue starts at zero, the novel cue remains at zero after compound trials — matching ΔP = 0 for the blocked cue.