DLM training: Endstate Learning vs Frequency-Informed Learning #
@cite{baayen-2019} @cite{gahl-baayen-2024} @cite{heitmeier-chuang-baayen-2026}
Sibling to Defs.lean, Normed.lean, Measures.lean. Hosts the
substrate for what counts as a learned production map — the
optimization characterization of EndState Learning (EL) and
Frequency-Informed Learning (FIL).
Paper-faithful representation: FrequencyVector (= paper's Q) #
FrequencyVector m is the type-faithful representation of paper
@cite{gahl-baayen-2024}'s diagonal frequency matrix Q (appendix §A1.3):
one nonneg weight per usage event. We do not normalise to a
probability distribution; the paper works with raw counts. The PMF
view is a derived bridge — call PMF.ofRealWeightFn from
Core.Probability.Constructions directly when cross-tradition
theorems require it. The ERM-equivalence between raw q and
normalised q.normalize is proved in the sibling file
RescalingInvariance.lean.
The cognitive interpretation: q i is the number of times the
learner has experienced event i. EL corresponds to type-uniform
weights (q ≡ 1); FIL corresponds to token-frequency weights
(q i = #occurrences).
What this file establishes #
The substantive cross-framework structure. Different cognitive
theories of learning correspond to different choices of q; the
substrate captures the architecture in which those theories diverge:
weightedLoss data q G— the per-event squared-error loss weighted byq. Paper appendix §A1.3 of @cite{gahl-baayen-2024}.IsERMSolution data q G—Gminimises the weighted loss. The cognitive theory choice IS the choice ofq; the optimization procedure is fixed.IsELSolution data GandIsFILSolution data q G— abbrevs capturing the type-uniform and frequency-weighted cases.weightedLoss_smul_frequency— loss is linear inq.ermSolution_iff_rescaled— T-Rescaling: the ERM solution is invariant under positive rescaling ofq. Relative frequencies matter; absolute scale doesn't. (Paper §3.1 appendix discussion of the equivalent FIL forms.)weightedLoss_zero_event_drops— T-Support (weak form): events withq i = 0contribute nothing to the loss. Novel unattested events don't update the lexicon.isELSolution_eq_isERM_uniform— T-Uniform-EL-equivalence: EL is ERM under the constant-1 frequency vector. Definitional.
What this file does NOT do #
This is not generic regression formalization. The substrate captures:
- The loss function
weightedLossas the cognitive commitment (paper §3 of @cite{gahl-baayen-2024}: minimising squared error per usage event is what the learner does). - The frequency-weight parameterisation as the cross-theory
axis (paper §3.1 distinguishes EL from FIL only via
q).
We do not formalise:
- The closed-form
(SᵀQS)⁻¹SᵀQC— that's matrix algebra, not theory-specific. A futureTraining/ClosedForm.leancould derive it from the optimization characterization here as a theorem (= "the closed form is the unique minimum whenSᵀQSis invertible") — but that's regression formalization in the service of showing equivalence to the optimization picture. - The iterative Widrow-Hoff convergence to
IsFILSolution(paper appendix §A5.1; @cite{heitmeier-chuang-baayen-2026} Heitmeier 2024 argument). Defer until a 2nd consumer needs it. - The PMF / ERM-theoretic reformulation. Mathematically equivalent in finite case, but interpretively additive — the paper authors avoid framing this as ERM under empirical distributions (see §6.4: "we would caution against reifying any particular variable on the basis of its predictiveness"). A derived PMF view is straightforward via normalization.
A training experience is a finite indexed collection of
(meaning, form) observation pairs. The paper's S matrix has
rows data.meanings i and C matrix has rows data.forms i,
indexed by event i : Fin m.
The cognitive interpretation: each i : Fin m is a "usage event"
— a single attestation of a (meaning, form) pair. Type-based
learning treats each unique pair as one event; frequency-informed
learning may have multiple events per pair (replication = weight,
deferred).
- meanings : Fin numEvents → MeaningVec meaningDim
- forms : Fin numEvents → FormVec formDim
Instances For
A frequency vector — paper notation Q (the diagonal entries
of the weight matrix in @cite{gahl-baayen-2024} appendix §A1.3).
One nonneg weight per usage event.
We use Fin m → ℝ (rather than Fin m → NNReal) for proof
convenience; nonnegativity is documented and asserted as a
hypothesis in theorems that need it. The cognitive theory choice
IS the choice of q:
q ≡ 1(uniform): type-based learning → ELq i = #occurrences i: frequency-informed → FILq i = log(#occurrences i): log-frequency learningq i = exp(-decay · timeAgo i): recency-weighted
All instantiations of FrequencyVector correspond to different
cognitive commitments about which usage events count for learning.
Equations
- Theories.Processing.Lexical.Discriminative.FrequencyVector numEvents = (Fin numEvents → ℝ)
Instances For
The constant-1 frequency vector — type-uniform weighting; every
event counts once regardless of frequency. Endstate learning
operates with this q.
Equations
Instances For
Total mass of a frequency vector — the sum of all event weights. Cognitive interpretation: the total accumulated experience the learner has been exposed to.
Equations
- q.totalMass = ∑ i : Fin m, q i
Instances For
Normalise a frequency vector so its weights sum to 1. The result
represents the empirical distribution over events. Note this
is a FrequencyVector (still typed as Fin m → ℝ); for the
actual PMF cast use PMF.ofRealWeightFn from
Core.Probability.Constructions.
Instances For
Squared coordinate-distance between two form vectors:
Σⱼ (a j - b j)². Direct formula, no normed-space machinery
required. Distinct from the bare-Pi sup-norm in Normed.lean;
here we want the L² (Frobenius) inner product structure on
Fin n → ℝ that the paper's quadratic loss uses.
Equations
- Theories.Processing.Lexical.Discriminative.squaredDist a b = ∑ j : Fin n, (a j - b j) ^ 2
Instances For
The frequency-weighted training loss for a candidate
production map G:
weightedLoss data q G = Σᵢ qᵢ · ‖G(meaningsᵢ) − formsᵢ‖²
where ‖·‖² is squared coordinate-distance (= squared Frobenius
norm on the form-vector slot).
This is the paper's loss in its appendix §A1.3 form, before
being recast as the matrix expression ‖√Q (SG − C)‖²_F. The
cognitive commitment is at the level of this loss function:
the learner minimises the per-event squared mismatch between
the produced and observed form vectors, weighted by frequency
of occurrence.
Equations
- Theories.Processing.Lexical.Discriminative.weightedLoss data q G = ∑ i : Fin m, q i * Theories.Processing.Lexical.Discriminative.squaredDist (G (data.meanings i)) (data.forms i)
Instances For
A linear map G is an empirical risk minimiser (ERM) for
the experience data under frequency vector q if no other
linear map achieves a smaller weighted loss.
The cognitive theory choice IS the choice of q. Different
theories of learning specify different q's; the optimisation
procedure (minimise the resulting weighted L² loss) is fixed
across them.
Equations
- One or more equations did not get rendered due to their size.
Instances For
An endstate-learning solution for the experience: an ERM solution under type-uniform weights. Paper appendix §A1.1 of @cite{gahl-baayen-2024}.
Equations
Instances For
A frequency-informed learning solution — abbreviation for
ERM under arbitrary q. The cognitive interpretation that q
is "token frequencies of usage events" lives in the choice of
q the consumer passes; the substrate is agnostic about which
q is empirically correct.
Paper appendix §A1.3 of @cite{gahl-baayen-2024} introduces FIL
with q = corpus token frequencies; future cognitive theories
may motivate other q choices.
Equations
Instances For
T-Loss-linearity: the weighted loss is linear in the
frequency vector. Direct algebraic fact; basis for T-Rescaling.
T-Loss-add-distrib: the weighted loss decomposes additively over the frequency vector. Used downstream to formalise mixture decompositions of cognitive theories.
T-Rescaling: ERM solutions are invariant under positive rescaling of the frequency vector. Relative frequencies matter; absolute scale doesn't.
Cognitive content: a learner who has experienced word A 100
times and word B 50 times has the same trained lexicon as one
who experienced A 200 times and B 100 times. Time of accumulated
experience (= overall scale of q) doesn't affect the trained
end-state — only the relative ratios.
Proved via weightedLoss_smul_frequency: scaling q by c > 0
scales the loss by c; argmins are invariant under positive
scalar multiples.
T-Support (weak form): if q i = 0, event i contributes
nothing to the weighted loss.
Cognitive content: events that the learner has not experienced
(q i = 0) cannot update the trained lexicon. Novel words and
pseudowords don't retroactively modify the production map; they
can only be processed by the existing D.production (which is
polymorphic over the entire meaning space, hence applies to any
meaning vector — but the trained map's coefficients reflect
only the support of q).
Definitional equivalence: an IsELSolution is exactly an ERM
solution under uniform weights. Paper-canonical: paper §3.1 of
@cite{gahl-baayen-2024} introduces EL as the special case of FIL
with Q = I. Here uniformFrequency = fun _ => 1 is the discrete
analogue.
A LinearDiscriminativeLexicon's production map is the
trained production map for given training data and
frequency weights iff the production map is an ERM solution.
The substrate is agnostic about which q is empirically
correct; this predicate just records the relationship between
a DLM's production map and a particular cognitive theory's
training data.
Equations
- D.IsTrainedOn data q = Theories.Processing.Lexical.Discriminative.IsERMSolution data q D.production
Instances For
A DLM is EL-trained for given data iff its production map is the type-uniform ERM solution.
Equations
- D.IsELTrainedOn data = D.IsTrainedOn data (Theories.Processing.Lexical.Discriminative.uniformFrequency m)
Instances For
A DLM is FIL-trained with a given frequency vector iff its production map is the corresponding ERM solution.
Equations
- D.IsFILTrainedOn data q = D.IsTrainedOn data q