Documentation

Linglib.Theories.Processing.Lexical.Discriminative.Training

DLM training: Endstate Learning vs Frequency-Informed Learning #

@cite{baayen-2019} @cite{gahl-baayen-2024} @cite{heitmeier-chuang-baayen-2026}

Sibling to Defs.lean, Normed.lean, Measures.lean. Hosts the substrate for what counts as a learned production map — the optimization characterization of EndState Learning (EL) and Frequency-Informed Learning (FIL).

Paper-faithful representation: FrequencyVector (= paper's Q) #

FrequencyVector m is the type-faithful representation of paper @cite{gahl-baayen-2024}'s diagonal frequency matrix Q (appendix §A1.3): one nonneg weight per usage event. We do not normalise to a probability distribution; the paper works with raw counts. The PMF view is a derived bridge — call PMF.ofRealWeightFn from Core.Probability.Constructions directly when cross-tradition theorems require it. The ERM-equivalence between raw q and normalised q.normalize is proved in the sibling file RescalingInvariance.lean.

The cognitive interpretation: q i is the number of times the learner has experienced event i. EL corresponds to type-uniform weights (q ≡ 1); FIL corresponds to token-frequency weights (q i = #occurrences).

What this file establishes #

The substantive cross-framework structure. Different cognitive theories of learning correspond to different choices of q; the substrate captures the architecture in which those theories diverge:

What this file does NOT do #

This is not generic regression formalization. The substrate captures:

  1. The loss function weightedLoss as the cognitive commitment (paper §3 of @cite{gahl-baayen-2024}: minimising squared error per usage event is what the learner does).
  2. The frequency-weight parameterisation as the cross-theory axis (paper §3.1 distinguishes EL from FIL only via q).

We do not formalise:

structure Theories.Processing.Lexical.Discriminative.TrainingExperience (numEvents formDim meaningDim : ) :

A training experience is a finite indexed collection of (meaning, form) observation pairs. The paper's S matrix has rows data.meanings i and C matrix has rows data.forms i, indexed by event i : Fin m.

The cognitive interpretation: each i : Fin m is a "usage event" — a single attestation of a (meaning, form) pair. Type-based learning treats each unique pair as one event; frequency-informed learning may have multiple events per pair (replication = weight, deferred).

  • meanings : Fin numEventsMeaningVec meaningDim
  • forms : Fin numEventsFormVec formDim
Instances For
    @[reducible, inline]

    A frequency vector — paper notation Q (the diagonal entries of the weight matrix in @cite{gahl-baayen-2024} appendix §A1.3). One nonneg weight per usage event.

    We use Fin m → ℝ (rather than Fin m → NNReal) for proof convenience; nonnegativity is documented and asserted as a hypothesis in theorems that need it. The cognitive theory choice IS the choice of q:

    • q ≡ 1 (uniform): type-based learning → EL
    • q i = #occurrences i: frequency-informed → FIL
    • q i = log(#occurrences i): log-frequency learning
    • q i = exp(-decay · timeAgo i): recency-weighted

    All instantiations of FrequencyVector correspond to different cognitive commitments about which usage events count for learning.

    Equations
    Instances For

      The constant-1 frequency vector — type-uniform weighting; every event counts once regardless of frequency. Endstate learning operates with this q.

      Equations
      Instances For

        Total mass of a frequency vector — the sum of all event weights. Cognitive interpretation: the total accumulated experience the learner has been exposed to.

        Equations
        Instances For

          Normalise a frequency vector so its weights sum to 1. The result represents the empirical distribution over events. Note this is a FrequencyVector (still typed as Fin m → ℝ); for the actual PMF cast use PMF.ofRealWeightFn from Core.Probability.Constructions.

          Equations
          Instances For

            Squared coordinate-distance between two form vectors: Σⱼ (a j - b j)². Direct formula, no normed-space machinery required. Distinct from the bare-Pi sup-norm in Normed.lean; here we want the L² (Frobenius) inner product structure on Fin n → ℝ that the paper's quadratic loss uses.

            Equations
            Instances For
              def Theories.Processing.Lexical.Discriminative.weightedLoss {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :

              The frequency-weighted training loss for a candidate production map G:

              weightedLoss data q G = Σᵢ qᵢ · ‖G(meaningsᵢ) − formsᵢ‖²

              where ‖·‖² is squared coordinate-distance (= squared Frobenius norm on the form-vector slot).

              This is the paper's loss in its appendix §A1.3 form, before being recast as the matrix expression ‖√Q (SG − C)‖²_F. The cognitive commitment is at the level of this loss function: the learner minimises the per-event squared mismatch between the produced and observed form vectors, weighted by frequency of occurrence.

              Equations
              Instances For

                A linear map G is an empirical risk minimiser (ERM) for the experience data under frequency vector q if no other linear map achieves a smaller weighted loss.

                The cognitive theory choice IS the choice of q. Different theories of learning specify different q's; the optimisation procedure (minimise the resulting weighted L² loss) is fixed across them.

                Equations
                • One or more equations did not get rendered due to their size.
                Instances For
                  @[reducible, inline]
                  abbrev Theories.Processing.Lexical.Discriminative.IsELSolution {m n d : } (data : TrainingExperience m n d) (G : MeaningVec d →ₗ[] FormVec n) :

                  An endstate-learning solution for the experience: an ERM solution under type-uniform weights. Paper appendix §A1.1 of @cite{gahl-baayen-2024}.

                  Equations
                  Instances For
                    @[reducible, inline]

                    A frequency-informed learning solution — abbreviation for ERM under arbitrary q. The cognitive interpretation that q is "token frequencies of usage events" lives in the choice of q the consumer passes; the substrate is agnostic about which q is empirically correct.

                    Paper appendix §A1.3 of @cite{gahl-baayen-2024} introduces FIL with q = corpus token frequencies; future cognitive theories may motivate other q choices.

                    Equations
                    Instances For
                      theorem Theories.Processing.Lexical.Discriminative.weightedLoss_smul_frequency {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (c : ) (G : MeaningVec d →ₗ[] FormVec n) :
                      weightedLoss data (c q) G = c * weightedLoss data q G

                      T-Loss-linearity: the weighted loss is linear in the frequency vector. Direct algebraic fact; basis for T-Rescaling.

                      theorem Theories.Processing.Lexical.Discriminative.weightedLoss_add_frequency {m n d : } (data : TrainingExperience m n d) (q₁ q₂ : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :
                      weightedLoss data (q₁ + q₂) G = weightedLoss data q₁ G + weightedLoss data q₂ G

                      T-Loss-add-distrib: the weighted loss decomposes additively over the frequency vector. Used downstream to formalise mixture decompositions of cognitive theories.

                      theorem Theories.Processing.Lexical.Discriminative.ermSolution_iff_rescaled {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) {c : } (hc : 0 < c) :
                      IsERMSolution data (c q) G IsERMSolution data q G

                      T-Rescaling: ERM solutions are invariant under positive rescaling of the frequency vector. Relative frequencies matter; absolute scale doesn't.

                      Cognitive content: a learner who has experienced word A 100 times and word B 50 times has the same trained lexicon as one who experienced A 200 times and B 100 times. Time of accumulated experience (= overall scale of q) doesn't affect the trained end-state — only the relative ratios.

                      Proved via weightedLoss_smul_frequency: scaling q by c > 0 scales the loss by c; argmins are invariant under positive scalar multiples.

                      theorem Theories.Processing.Lexical.Discriminative.weightedLoss_zero_event_drops {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) (i₀ : Fin m) (h : q i₀ = 0) :
                      weightedLoss data q G = weightedLoss data (Function.update q i₀ 0) G

                      T-Support (weak form): if q i = 0, event i contributes nothing to the weighted loss.

                      Cognitive content: events that the learner has not experienced (q i = 0) cannot update the trained lexicon. Novel words and pseudowords don't retroactively modify the production map; they can only be processed by the existing D.production (which is polymorphic over the entire meaning space, hence applies to any meaning vector — but the trained map's coefficients reflect only the support of q).

                      Definitional equivalence: an IsELSolution is exactly an ERM solution under uniform weights. Paper-canonical: paper §3.1 of @cite{gahl-baayen-2024} introduces EL as the special case of FIL with Q = I. Here uniformFrequency = fun _ => 1 is the discrete analogue.

                      A LinearDiscriminativeLexicon's production map is the trained production map for given training data and frequency weights iff the production map is an ERM solution.

                      The substrate is agnostic about which q is empirically correct; this predicate just records the relationship between a DLM's production map and a particular cognitive theory's training data.

                      Equations
                      Instances For
                        @[reducible, inline]

                        A DLM is EL-trained for given data iff its production map is the type-uniform ERM solution.

                        Equations
                        Instances For
                          @[reducible, inline]

                          A DLM is FIL-trained with a given frequency vector iff its production map is the corresponding ERM solution.

                          Equations
                          Instances For