Documentation

Linglib.Processing.Lexical.Discriminative.Training

DLM training: endstate vs frequency-informed learning #

[BCSBB19] [GB24] [HCB26]

What counts as a learned production map: the optimization characterization of EndState Learning (EL) and Frequency-Informed Learning (FIL). The cognitive theory choice IS the choice of frequency weights q; the optimization — minimise the weighted squared loss — is fixed across theories. FrequencyVector is [GB24]'s diagonal Q (appendix §A1.3); [HCB26] (ch. 6) introduces the same device with max-normalised entries applied as √P premultiplication of both S and CermSolution_iff_rescaled connects the two forms.

Main declarations #

Implementation notes #

This is not generic regression formalization: the loss function is the cognitive commitment ([GB24] §3) and the frequency-weight parameterisation is the cross-theory axis. squaredDist is the L² kernel the quadratic loss uses, distinct from Normed.lean's sup-norm carrier structure. The PMF view of q.normalize is a derived bridge (PMF.ofRealWeightFn in Core.Probability.Constructions).

TODO #

Substrate types #

structure Processing.Lexical.Discriminative.TrainingExperience (numEvents formDim meaningDim : ) :

A training experience is a finite indexed collection of (meaning, form) observation pairs. The paper's S matrix has rows data.meanings i and C matrix has rows data.forms i, indexed by event i : Fin m.

The cognitive interpretation: each i : Fin m is a "usage event" — a single attestation of a (meaning, form) pair. Type-based learning treats each unique pair as one event; frequency-informed learning may have multiple events per pair (replication = weight, deferred).

  • meanings : Fin numEventsMeaningVec meaningDim
  • forms : Fin numEventsFormVec formDim
Instances For
    @[reducible, inline]

    A frequency vector — paper notation Q (the diagonal entries of the weight matrix in [GB24] appendix §A1.3). One nonneg weight per usage event.

    We use Fin m → ℝ (rather than Fin m → NNReal) for proof convenience; nonnegativity is documented and asserted as a hypothesis in theorems that need it. The cognitive theory choice IS the choice of q:

    • q ≡ 1 (uniform): type-based learning → EL
    • q i = #occurrences i: frequency-informed → FIL
    • q i = log(#occurrences i): log-frequency learning
    • q i = exp(-decay · timeAgo i): recency-weighted

    All instantiations of FrequencyVector correspond to different cognitive commitments about which usage events count for learning.

    Equations
    Instances For

      The constant-1 frequency vector — type-uniform weighting; every event counts once regardless of frequency. Endstate learning operates with this q.

      Equations
      Instances For

        Total mass of a frequency vector — the sum of all event weights. Cognitive interpretation: the total accumulated experience the learner has been exposed to.

        Equations
        Instances For

          Normalise a frequency vector so its weights sum to 1. The result represents the empirical distribution over events. Note this is a FrequencyVector (still typed as Fin m → ℝ); for the actual PMF cast use PMF.ofRealWeightFn from Core.Probability.Constructions.

          Equations
          Instances For

            The weighted loss #

            Squared coordinate-distance between two form vectors: Σⱼ (a j - b j)². Direct formula, no normed-space machinery required. Distinct from the bare-Pi sup-norm in Normed.lean; here we want the L² (Frobenius) inner product structure on Fin n → ℝ that the paper's quadratic loss uses.

            Equations
            Instances For
              def Processing.Lexical.Discriminative.weightedLoss {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :

              The frequency-weighted training loss for a candidate production map G:

              weightedLoss data q G = Σᵢ qᵢ · ‖G(meaningsᵢ) − formsᵢ‖²

              where ‖·‖² is squared coordinate-distance (= squared Frobenius norm on the form-vector slot).

              This is the paper's loss in its appendix §A1.3 form, before being recast as the matrix expression ‖√Q (SG − C)‖²_F. The cognitive commitment is at the level of this loss function: the learner minimises the per-event squared mismatch between the produced and observed form vectors, weighted by frequency of occurrence.

              Equations
              Instances For
                theorem Processing.Lexical.Discriminative.weightedLoss_nonneg {m n d : } (data : TrainingExperience m n d) {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 q i) (G : MeaningVec d →ₗ[] FormVec n) :
                0 weightedLoss data q G

                Solution Props: ERM, EL, FIL #

                def Processing.Lexical.Discriminative.IsERMSolution {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :

                A linear map G is an empirical risk minimiser (ERM) for the experience data under frequency vector q if no other linear map achieves a smaller weighted loss.

                The cognitive theory choice IS the choice of q. Different theories of learning specify different q's; the optimisation procedure (minimise the resulting weighted L² loss) is fixed across them.

                Equations
                • One or more equations did not get rendered due to their size.
                Instances For
                  @[reducible, inline]
                  abbrev Processing.Lexical.Discriminative.IsELSolution {m n d : } (data : TrainingExperience m n d) (G : MeaningVec d →ₗ[] FormVec n) :

                  An endstate-learning solution for the experience: an ERM solution under type-uniform weights. Paper appendix §A1.1 of [GB24].

                  Equations
                  Instances For
                    @[reducible, inline]
                    abbrev Processing.Lexical.Discriminative.IsFILSolution {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :

                    A frequency-informed learning solution — abbreviation for ERM under arbitrary q. The cognitive interpretation that q is "token frequencies of usage events" lives in the choice of q the consumer passes; the substrate is agnostic about which q is empirically correct.

                    Paper appendix §A1.3 of [GB24] introduces FIL with q = corpus token frequencies; future cognitive theories may motivate other q choices.

                    Equations
                    Instances For

                      Structural theorems #

                      theorem Processing.Lexical.Discriminative.weightedLoss_smul_frequency {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (c : ) (G : MeaningVec d →ₗ[] FormVec n) :
                      weightedLoss data (c q) G = c * weightedLoss data q G

                      T-Loss-linearity: the weighted loss is linear in the frequency vector. Direct algebraic fact; basis for T-Rescaling.

                      theorem Processing.Lexical.Discriminative.weightedLoss_add_frequency {m n d : } (data : TrainingExperience m n d) (q₁ q₂ : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :
                      weightedLoss data (q₁ + q₂) G = weightedLoss data q₁ G + weightedLoss data q₂ G

                      T-Loss-add-distrib: the weighted loss decomposes additively over the frequency vector. Used downstream to formalise mixture decompositions of cognitive theories.

                      theorem Processing.Lexical.Discriminative.ermSolution_iff_rescaled {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) {c : } (hc : 0 < c) :
                      IsERMSolution data (c q) G IsERMSolution data q G

                      T-Rescaling: ERM solutions are invariant under positive rescaling of the frequency vector. Relative frequencies matter; absolute scale doesn't.

                      Cognitive content: a learner who has experienced word A 100 times and word B 50 times has the same trained lexicon as one who experienced A 200 times and B 100 times. Time of accumulated experience (= overall scale of q) doesn't affect the trained end-state — only the relative ratios.

                      Proved via weightedLoss_smul_frequency: scaling q by c > 0 scales the loss by c; argmins are invariant under positive scalar multiples.

                      theorem Processing.Lexical.Discriminative.weightedLoss_zero_event_drops {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) (i₀ : Fin m) (h : q i₀ = 0) :
                      weightedLoss data q G = weightedLoss data (Function.update q i₀ 0) G

                      T-Support (weak form): if q i = 0, event i contributes nothing to the weighted loss.

                      Cognitive content: events that the learner has not experienced (q i = 0) cannot update the trained lexicon. Novel words and pseudowords don't retroactively modify the production map; they can only be processed by the existing D.production (which is polymorphic over the entire meaning space, hence applies to any meaning vector — but the trained map's coefficients reflect only the support of q).

                      Definitional equivalence: an IsELSolution is exactly an ERM solution under uniform weights. Paper-canonical: paper §3.1 of [GB24] introduces EL as the special case of FIL with Q = I. Here uniformFrequency = fun _ => 1 is the discrete analogue.

                      Per-coordinate residual optimality #

                      The loss separates across form coordinates: each column of G regresses one cue's support on the semantic dimensions, "optimal in the least-squares sense" ([HCB26] ch. 6). So ERM is exactly columnwise unbeatability, and a coordinate that some functional decodes exactly is reproduced exactly.

                      def Processing.Lexical.Discriminative.coordResidual {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (pred : Fin m) (j₀ : Fin n) :

                      Weighted squared residual of a predicted column pred against form coordinate j₀ of the training data.

                      Equations
                      Instances For
                        theorem Processing.Lexical.Discriminative.coordResidual_nonneg {m n d : } (data : TrainingExperience m n d) {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 q i) (pred : Fin m) (j₀ : Fin n) :
                        0 coordResidual data q pred j₀
                        theorem Processing.Lexical.Discriminative.weightedLoss_eq_sum_coordResidual {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :
                        weightedLoss data q G = j : Fin n, coordResidual data q (fun (k : Fin m) => G (data.meanings k) j) j

                        T-Separability: the weighted loss is the sum over form coordinates of the per-coordinate residuals — each column of the production map is an independent regression.

                        theorem Processing.Lexical.Discriminative.isERMSolution_iff_coordResidual {m n d : } (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[] FormVec n) :
                        IsERMSolution data q G ∀ (j₀ : Fin n) (w : MeaningVec d →ₗ[] ), coordResidual data q (fun (k : Fin m) => G (data.meanings k) j₀) j₀ coordResidual data q (fun (k : Fin m) => w (data.meanings k)) j₀

                        T-Coordinate-optimality: G is an ERM solution iff at every form coordinate its column's residual is at most that of any linear functional of the meanings — ERM is columnwise-unbeatable regression. No sign condition on q in either direction.

                        theorem Processing.Lexical.Discriminative.IsERMSolution.coord_eq_of_decodable {m n d : } {data : TrainingExperience m n d} {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 < q i) {G : MeaningVec d →ₗ[] FormVec n} (hG : IsERMSolution data q G) {j₀ : Fin n} {w : MeaningVec d →ₗ[] } (hw : ∀ (i : Fin m), w (data.meanings i) = data.forms i j₀) (i : Fin m) :
                        G (data.meanings i) j₀ = data.forms i j₀

                        T-Decodable-exact-fit: if some linear functional w of the meanings exactly reproduces coordinate j₀ of the observed forms, then any ERM solution under positive weights also reproduces it exactly on the training events. Zero-residual case of isERMSolution_iff_coordResidual.

                        theorem Processing.Lexical.Discriminative.isERMSolution_of_interpolates {m n d : } {data : TrainingExperience m n d} {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 q i) {G : MeaningVec d →ₗ[] FormVec n} (hG : ∀ (i : Fin m), G (data.meanings i) = data.forms i) :
                        IsERMSolution data q G

                        T-Interpolation: a map that reproduces every training form exactly is an ERM solution under any nonnegative weights — IsERMSolution is nonvacuous whenever the data are linearly interpolable.

                        Connection to LinearDiscriminativeLexicon #

                        A LinearDiscriminativeLexicon's production map is the trained production map for given training data and frequency weights iff the production map is an ERM solution.

                        The substrate is agnostic about which q is empirically correct; this predicate just records the relationship between a DLM's production map and a particular cognitive theory's training data.

                        Equations
                        Instances For
                          @[reducible, inline]

                          A DLM is EL-trained for given data iff its production map is the type-uniform ERM solution.

                          Equations
                          Instances For
                            @[reducible, inline]

                            A DLM is FIL-trained with a given frequency vector iff its production map is the corresponding ERM solution.

                            Equations
                            Instances For
                              theorem Processing.Lexical.Discriminative.LinearDiscriminativeLexicon.IsTrainedOn.semSup_eq_of_decodable {m n d : } {D : LinearDiscriminativeLexicon (FormVec n) (MeaningVec d)} {data : TrainingExperience m n d} {q : FrequencyVector m} (hD : D.IsTrainedOn data q) (hq : ∀ (i : Fin m), 0 < q i) {j₀ : Fin n} {w : MeaningVec d →ₗ[] } (hw : ∀ (i : Fin m), w (data.meanings i) = data.forms i j₀) (i : Fin m) :
                              semSup D (data.meanings i) j₀ = data.forms i j₀

                              A trained DLM's semantic support at a linearly decodable form coordinate equals the observed form value on every positively-weighted training event; any contrast carried by that coordinate — categorical or graded — transfers to semSup exactly.

                              theorem Processing.Lexical.Discriminative.LinearDiscriminativeLexicon.IsTrainedOn.semSupWord_eq_of_decodable {m n d : } {D : LinearDiscriminativeLexicon (FormVec n) (MeaningVec d)} {data : TrainingExperience m n d} {q : FrequencyVector m} (hD : D.IsTrainedOn data q) (hq : ∀ (i : Fin m), 0 < q i) {js : List (Fin n)} (hw : jjs, ∃ (w : MeaningVec d →ₗ[] ), ∀ (i : Fin m), w (data.meanings i) = data.forms i j) (i : Fin m) :
                              semSupWord D (data.meanings i) js = (List.map (fun (j : Fin n) => data.forms i j) js).sum

                              [HCB26]'s Semantic Support for FormsemSupWord over a word's cue coordinates — equals the sum of the observed form values whenever each coordinate is linearly decodable from the meanings.