DLM training: endstate vs frequency-informed learning #
What counts as a learned production map: the optimization characterization
of EndState Learning (EL) and Frequency-Informed Learning (FIL). The
cognitive theory choice IS the choice of frequency weights q; the
optimization — minimise the weighted squared loss — is fixed across
theories. FrequencyVector is [GB24]'s diagonal Q
(appendix §A1.3); [HCB26] (ch. 6) introduces the
same device with max-normalised entries applied as √P premultiplication
of both S and C — ermSolution_iff_rescaled connects the two forms.
Main declarations #
TrainingExperience,FrequencyVector,weightedLoss,IsERMSolution— data, weights, loss, and minimiser Prop;IsELSolution/IsFILSolutionare the uniform and frequency-weighted cases.ermSolution_iff_rescaled— T-Rescaling: ERM solutions are invariant under positive rescaling ofq.weightedLoss_zero_event_drops— T-Support: unexperienced events cannot update the lexicon.coordResidual,weightedLoss_eq_sum_coordResidual— T-Separability: the loss is a sum of per-coordinate regression residuals.isERMSolution_iff_coordResidual— T-Coordinate-optimality: ERM = columnwise-unbeatable regression; no sign condition onq.IsERMSolution.coord_eq_of_decodable— T-Decodable-exact-fit: an exactly decodable form coordinate is reproduced exactly under positive weights; corollariesIsTrainedOn.semSup_eq_of_decodableandsemSupWord_eq_of_decodabletransfer this to the semantic- support measures — the architectural core of [STB25]'s inflectional-status finding.isERMSolution_of_interpolates— T-Interpolation: interpolating maps are ERM, soIsERMSolutionis nonvacuous on interpolable data.
Implementation notes #
This is not generic regression formalization: the loss function is the
cognitive commitment ([GB24] §3) and the frequency-weight
parameterisation is the cross-theory axis. squaredDist is the L²
kernel the quadratic loss uses, distinct from Normed.lean's sup-norm
carrier structure. The PMF view of q.normalize is a derived bridge
(PMF.ofRealWeightFn in Core.Probability.Constructions).
TODO #
- General existence
exists_isERMSolutionvia the normal equations / closed formG = (SᵀS)⁻¹SᵀC([HCB26] ch. 6) — a futureTraining/ClosedForm.lean. - Iterative Widrow-Hoff convergence to the FIL solution.
- Approximate-decodability gap bounds for
semSup(quantitative form of the Saito contrast).
Substrate types #
A training experience is a finite indexed collection of
(meaning, form) observation pairs. The paper's S matrix has
rows data.meanings i and C matrix has rows data.forms i,
indexed by event i : Fin m.
The cognitive interpretation: each i : Fin m is a "usage event"
— a single attestation of a (meaning, form) pair. Type-based
learning treats each unique pair as one event; frequency-informed
learning may have multiple events per pair (replication = weight,
deferred).
- meanings : Fin numEvents → MeaningVec meaningDim
- forms : Fin numEvents → FormVec formDim
Instances For
A frequency vector — paper notation Q (the diagonal entries
of the weight matrix in [GB24] appendix §A1.3).
One nonneg weight per usage event.
We use Fin m → ℝ (rather than Fin m → NNReal) for proof
convenience; nonnegativity is documented and asserted as a
hypothesis in theorems that need it. The cognitive theory choice
IS the choice of q:
q ≡ 1(uniform): type-based learning → ELq i = #occurrences i: frequency-informed → FILq i = log(#occurrences i): log-frequency learningq i = exp(-decay · timeAgo i): recency-weighted
All instantiations of FrequencyVector correspond to different
cognitive commitments about which usage events count for learning.
Equations
- Processing.Lexical.Discriminative.FrequencyVector numEvents = (Fin numEvents → ℝ)
Instances For
The constant-1 frequency vector — type-uniform weighting; every
event counts once regardless of frequency. Endstate learning
operates with this q.
Equations
Instances For
Total mass of a frequency vector — the sum of all event weights. Cognitive interpretation: the total accumulated experience the learner has been exposed to.
Equations
- q.totalMass = ∑ i : Fin m, q i
Instances For
Normalise a frequency vector so its weights sum to 1. The result
represents the empirical distribution over events. Note this
is a FrequencyVector (still typed as Fin m → ℝ); for the
actual PMF cast use PMF.ofRealWeightFn from
Core.Probability.Constructions.
Instances For
The weighted loss #
Squared coordinate-distance between two form vectors:
Σⱼ (a j - b j)². Direct formula, no normed-space machinery
required. Distinct from the bare-Pi sup-norm in Normed.lean;
here we want the L² (Frobenius) inner product structure on
Fin n → ℝ that the paper's quadratic loss uses.
Equations
- Processing.Lexical.Discriminative.squaredDist a b = ∑ j : Fin n, (a j - b j) ^ 2
Instances For
The frequency-weighted training loss for a candidate
production map G:
weightedLoss data q G = Σᵢ qᵢ · ‖G(meaningsᵢ) − formsᵢ‖²
where ‖·‖² is squared coordinate-distance (= squared Frobenius
norm on the form-vector slot).
This is the paper's loss in its appendix §A1.3 form, before
being recast as the matrix expression ‖√Q (SG − C)‖²_F. The
cognitive commitment is at the level of this loss function:
the learner minimises the per-event squared mismatch between
the produced and observed form vectors, weighted by frequency
of occurrence.
Equations
- Processing.Lexical.Discriminative.weightedLoss data q G = ∑ i : Fin m, q i * Processing.Lexical.Discriminative.squaredDist (G (data.meanings i)) (data.forms i)
Instances For
Solution Props: ERM, EL, FIL #
A linear map G is an empirical risk minimiser (ERM) for
the experience data under frequency vector q if no other
linear map achieves a smaller weighted loss.
The cognitive theory choice IS the choice of q. Different
theories of learning specify different q's; the optimisation
procedure (minimise the resulting weighted L² loss) is fixed
across them.
Equations
- One or more equations did not get rendered due to their size.
Instances For
An endstate-learning solution for the experience: an ERM solution under type-uniform weights. Paper appendix §A1.1 of [GB24].
Equations
Instances For
A frequency-informed learning solution — abbreviation for
ERM under arbitrary q. The cognitive interpretation that q
is "token frequencies of usage events" lives in the choice of
q the consumer passes; the substrate is agnostic about which
q is empirically correct.
Paper appendix §A1.3 of [GB24] introduces FIL
with q = corpus token frequencies; future cognitive theories
may motivate other q choices.
Equations
Instances For
Structural theorems #
T-Loss-linearity: the weighted loss is linear in the
frequency vector. Direct algebraic fact; basis for T-Rescaling.
T-Loss-add-distrib: the weighted loss decomposes additively over the frequency vector. Used downstream to formalise mixture decompositions of cognitive theories.
T-Rescaling: ERM solutions are invariant under positive rescaling of the frequency vector. Relative frequencies matter; absolute scale doesn't.
Cognitive content: a learner who has experienced word A 100
times and word B 50 times has the same trained lexicon as one
who experienced A 200 times and B 100 times. Time of accumulated
experience (= overall scale of q) doesn't affect the trained
end-state — only the relative ratios.
Proved via weightedLoss_smul_frequency: scaling q by c > 0
scales the loss by c; argmins are invariant under positive
scalar multiples.
T-Support (weak form): if q i = 0, event i contributes
nothing to the weighted loss.
Cognitive content: events that the learner has not experienced
(q i = 0) cannot update the trained lexicon. Novel words and
pseudowords don't retroactively modify the production map; they
can only be processed by the existing D.production (which is
polymorphic over the entire meaning space, hence applies to any
meaning vector — but the trained map's coefficients reflect
only the support of q).
Definitional equivalence: an IsELSolution is exactly an ERM
solution under uniform weights. Paper-canonical: paper §3.1 of
[GB24] introduces EL as the special case of FIL
with Q = I. Here uniformFrequency = fun _ => 1 is the discrete
analogue.
Per-coordinate residual optimality #
The loss separates across form coordinates: each column of G
regresses one cue's support on the semantic dimensions, "optimal in
the least-squares sense" ([HCB26] ch. 6). So
ERM is exactly columnwise unbeatability, and a coordinate that some
functional decodes exactly is reproduced exactly.
Weighted squared residual of a predicted column pred against
form coordinate j₀ of the training data.
Equations
- Processing.Lexical.Discriminative.coordResidual data q pred j₀ = ∑ k : Fin m, q k * (pred k - data.forms k j₀) ^ 2
Instances For
T-Separability: the weighted loss is the sum over form coordinates of the per-coordinate residuals — each column of the production map is an independent regression.
T-Coordinate-optimality: G is an ERM solution iff at every form
coordinate its column's residual is at most that of any linear
functional of the meanings — ERM is columnwise-unbeatable regression.
No sign condition on q in either direction.
T-Decodable-exact-fit: if some linear functional w of the meanings
exactly reproduces coordinate j₀ of the observed forms, then any ERM
solution under positive weights also reproduces it exactly on the
training events. Zero-residual case of isERMSolution_iff_coordResidual.
T-Interpolation: a map that reproduces every training form exactly
is an ERM solution under any nonnegative weights — IsERMSolution is
nonvacuous whenever the data are linearly interpolable.
Connection to LinearDiscriminativeLexicon #
A LinearDiscriminativeLexicon's production map is the
trained production map for given training data and
frequency weights iff the production map is an ERM solution.
The substrate is agnostic about which q is empirically
correct; this predicate just records the relationship between
a DLM's production map and a particular cognitive theory's
training data.
Equations
- D.IsTrainedOn data q = Processing.Lexical.Discriminative.IsERMSolution data q D.production
Instances For
A DLM is EL-trained for given data iff its production map is the type-uniform ERM solution.
Equations
- D.IsELTrainedOn data = D.IsTrainedOn data (Processing.Lexical.Discriminative.uniformFrequency m)
Instances For
A DLM is FIL-trained with a given frequency vector iff its production map is the corresponding ERM solution.
Equations
- D.IsFILTrainedOn data q = D.IsTrainedOn data q
Instances For
A trained DLM's semantic support at a linearly decodable form coordinate
equals the observed form value on every positively-weighted training
event; any contrast carried by that coordinate — categorical or graded —
transfers to semSup exactly.
[HCB26]'s Semantic Support for Form — semSupWord
over a word's cue coordinates — equals the sum of the observed form values
whenever each coordinate is linearly decodable from the meanings.