Documentation

Linglib.Processing.Lexical.Discriminative.Training

DLM training: endstate vs frequency-informed learning #

[BCSBB19] [GB24] [HCB26]

What counts as a learned production map: the optimization characterization of EndState Learning (EL) and Frequency-Informed Learning (FIL). The cognitive theory choice IS the choice of frequency weights q; the optimization — minimise the weighted squared loss — is fixed across theories. FrequencyVector is [GB24]'s diagonal Q (appendix §A1.3); [HCB26] (ch. 6) introduces the same device with max-normalised entries applied as √P premultiplication of both S and C — ermSolution_iff_rescaled connects the two forms.

Main declarations #

TrainingExperience, FrequencyVector, weightedLoss, IsERMSolution — data, weights, loss, and minimiser Prop; IsELSolution / IsFILSolution are the uniform and frequency-weighted cases.
ermSolution_iff_rescaled — T-Rescaling: ERM solutions are invariant under positive rescaling of q.
weightedLoss_zero_event_drops — T-Support: unexperienced events cannot update the lexicon.
coordResidual, weightedLoss_eq_sum_coordResidual — T-Separability: the loss is a sum of per-coordinate regression residuals.
isERMSolution_iff_coordResidual — T-Coordinate-optimality: ERM = columnwise-unbeatable regression; no sign condition on q.
IsERMSolution.coord_eq_of_decodable — T-Decodable-exact-fit: an exactly decodable form coordinate is reproduced exactly under positive weights; corollaries IsTrainedOn.semSup_eq_of_decodable and semSupWord_eq_of_decodable transfer this to the semantic- support measures — the architectural core of [STB25]'s inflectional-status finding.
isERMSolution_of_interpolates — T-Interpolation: interpolating maps are ERM, so IsERMSolution is nonvacuous on interpolable data.

Implementation notes #

This is not generic regression formalization: the loss function is the cognitive commitment ([GB24] §3) and the frequency-weight parameterisation is the cross-theory axis. squaredDist is the L² kernel the quadratic loss uses, distinct from Normed.lean's sup-norm carrier structure. The PMF view of q.normalize is a derived bridge (PMF.ofRealWeightFn in Core.Probability.Constructions).

TODO #

General existence exists_isERMSolution via the normal equations / closed form G = (SᵀS)⁻¹SᵀC ([HCB26] ch. 6) — a future Training/ClosedForm.lean.
Iterative Widrow-Hoff convergence to the FIL solution.
Approximate-decodability gap bounds for semSup (quantitative form of the Saito contrast).

Substrate types #

structure Processing.Lexical.Discriminative.TrainingExperience (numEvents formDim meaningDim : ℕ) :

A training experience is a finite indexed collection of (meaning, form) observation pairs. The paper's S matrix has rows data.meanings i and C matrix has rows data.forms i, indexed by event i : Fin m.

The cognitive interpretation: each i : Fin m is a "usage event" — a single attestation of a (meaning, form) pair. Type-based learning treats each unique pair as one event; frequency-informed learning may have multiple events per pair (replication = weight, deferred).

meanings : Fin numEvents → MeaningVec meaningDim
forms : Fin numEvents → FormVec formDim

Instances For

@[reducible, inline]

abbrev Processing.Lexical.Discriminative.FrequencyVector (numEvents : ℕ) :

A frequency vector — paper notation Q (the diagonal entries of the weight matrix in [GB24] appendix §A1.3). One nonneg weight per usage event.

We use Fin m → ℝ (rather than Fin m → NNReal) for proof convenience; nonnegativity is documented and asserted as a hypothesis in theorems that need it. The cognitive theory choice IS the choice of q:

q ≡ 1 (uniform): type-based learning → EL
q i = #occurrences i: frequency-informed → FIL
q i = log(#occurrences i): log-frequency learning
q i = exp(-decay · timeAgo i): recency-weighted

All instantiations of FrequencyVector correspond to different cognitive commitments about which usage events count for learning.

Equations

Processing.Lexical.Discriminative.FrequencyVector numEvents = (Fin numEvents → ℝ)

Instances For

def Processing.Lexical.Discriminative.uniformFrequency (m : ℕ) :

FrequencyVector m

The constant-1 frequency vector — type-uniform weighting; every event counts once regardless of frequency. Endstate learning operates with this q.

Equations

Processing.Lexical.Discriminative.uniformFrequency m x✝ = 1

Instances For

def Processing.Lexical.Discriminative.FrequencyVector.totalMass {m : ℕ} (q : FrequencyVector m) :

ℝ

Total mass of a frequency vector — the sum of all event weights. Cognitive interpretation: the total accumulated experience the learner has been exposed to.

Equations

q.totalMass = ∑ i : Fin m, q i

Instances For

noncomputable def Processing.Lexical.Discriminative.FrequencyVector.normalize {m : ℕ} (q : FrequencyVector m) :

FrequencyVector m

Normalise a frequency vector so its weights sum to 1. The result represents the empirical distribution over events. Note this is a FrequencyVector (still typed as Fin m → ℝ); for the actual PMF cast use PMF.ofRealWeightFn from Core.Probability.Constructions.

Equations

q.normalize i = q i / q.totalMass

Instances For

@[simp]

theorem Processing.Lexical.Discriminative.FrequencyVector.normalize_apply {m : ℕ} (q : FrequencyVector m) (i : Fin m) :

q.normalize i = q i / q.totalMass

The weighted loss #

def Processing.Lexical.Discriminative.squaredDist {n : ℕ} (a b : FormVec n) :

ℝ

Squared coordinate-distance between two form vectors: Σⱼ (a j - b j)². Direct formula, no normed-space machinery required. Distinct from the bare-Pi sup-norm in Normed.lean; here we want the L² (Frobenius) inner product structure on Fin n → ℝ that the paper's quadratic loss uses.

Equations

Processing.Lexical.Discriminative.squaredDist a b = ∑ j : Fin n, (a j - b j) ^ 2

Instances For

theorem Processing.Lexical.Discriminative.squaredDist_self {n : ℕ} (a : FormVec n) :

squaredDist a a = 0

theorem Processing.Lexical.Discriminative.squaredDist_nonneg {n : ℕ} (a b : FormVec n) :

0 ≤ squaredDist a b

def Processing.Lexical.Discriminative.weightedLoss {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

ℝ

The frequency-weighted training loss for a candidate production map G:

weightedLoss data q G = Σᵢ qᵢ · ‖G(meaningsᵢ) − formsᵢ‖²

where ‖·‖² is squared coordinate-distance (= squared Frobenius norm on the form-vector slot).

This is the paper's loss in its appendix §A1.3 form, before being recast as the matrix expression ‖√Q (SG − C)‖²_F. The cognitive commitment is at the level of this loss function: the learner minimises the per-event squared mismatch between the produced and observed form vectors, weighted by frequency of occurrence.

Equations

Processing.Lexical.Discriminative.weightedLoss data q G = ∑ i : Fin m, q i * Processing.Lexical.Discriminative.squaredDist (G (data.meanings i)) (data.forms i)

Instances For

theorem Processing.Lexical.Discriminative.weightedLoss_nonneg {m n d : ℕ} (data : TrainingExperience m n d) {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 ≤ q i) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

0 ≤ weightedLoss data q G

Solution Props: ERM, EL, FIL #

def Processing.Lexical.Discriminative.IsERMSolution {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

A linear map G is an empirical risk minimiser (ERM) for the experience data under frequency vector q if no other linear map achieves a smaller weighted loss.

The cognitive theory choice IS the choice of q. Different theories of learning specify different q's; the optimisation procedure (minimise the resulting weighted L² loss) is fixed across them.

Equations

One or more equations did not get rendered due to their size.

Instances For

@[reducible, inline]

abbrev Processing.Lexical.Discriminative.IsELSolution {m n d : ℕ} (data : TrainingExperience m n d) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

An endstate-learning solution for the experience: an ERM solution under type-uniform weights. Paper appendix §A1.1 of [GB24].

Equations

Processing.Lexical.Discriminative.IsELSolution data G = Processing.Lexical.Discriminative.IsERMSolution data (Processing.Lexical.Discriminative.uniformFrequency m) G

Instances For

@[reducible, inline]

abbrev Processing.Lexical.Discriminative.IsFILSolution {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

A frequency-informed learning solution — abbreviation for ERM under arbitrary q. The cognitive interpretation that q is "token frequencies of usage events" lives in the choice of q the consumer passes; the substrate is agnostic about which q is empirically correct.

Paper appendix §A1.3 of [GB24] introduces FIL with q = corpus token frequencies; future cognitive theories may motivate other q choices.

Equations

Processing.Lexical.Discriminative.IsFILSolution data q G = Processing.Lexical.Discriminative.IsERMSolution data q G

Instances For

Structural theorems #

theorem Processing.Lexical.Discriminative.weightedLoss_smul_frequency {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (c : ℝ) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

weightedLoss data (c • q) G = c * weightedLoss data q G

T-Loss-linearity: the weighted loss is linear in the frequency vector. Direct algebraic fact; basis for T-Rescaling.

theorem Processing.Lexical.Discriminative.weightedLoss_add_frequency {m n d : ℕ} (data : TrainingExperience m n d) (q₁ q₂ : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

weightedLoss data (q₁ + q₂) G = weightedLoss data q₁ G + weightedLoss data q₂ G

T-Loss-add-distrib: the weighted loss decomposes additively over the frequency vector. Used downstream to formalise mixture decompositions of cognitive theories.

theorem Processing.Lexical.Discriminative.ermSolution_iff_rescaled {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) {c : ℝ} (hc : 0 < c) :

IsERMSolution data (c • q) G ↔ IsERMSolution data q G

T-Rescaling: ERM solutions are invariant under positive rescaling of the frequency vector. Relative frequencies matter; absolute scale doesn't.

Cognitive content: a learner who has experienced word A 100 times and word B 50 times has the same trained lexicon as one who experienced A 200 times and B 100 times. Time of accumulated experience (= overall scale of q) doesn't affect the trained end-state — only the relative ratios.

Proved via weightedLoss_smul_frequency: scaling q by c > 0 scales the loss by c; argmins are invariant under positive scalar multiples.

theorem Processing.Lexical.Discriminative.weightedLoss_zero_event_drops {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) (i₀ : Fin m) (h : q i₀ = 0) :

weightedLoss data q G = weightedLoss data (Function.update q i₀ 0) G

T-Support (weak form): if q i = 0, event i contributes nothing to the weighted loss.

Cognitive content: events that the learner has not experienced (q i = 0) cannot update the trained lexicon. Novel words and pseudowords don't retroactively modify the production map; they can only be processed by the existing D.production (which is polymorphic over the entire meaning space, hence applies to any meaning vector — but the trained map's coefficients reflect only the support of q).

theorem Processing.Lexical.Discriminative.isELSolution_eq_isERM_uniform {m n d : ℕ} (data : TrainingExperience m n d) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

IsELSolution data G ↔ IsERMSolution data (uniformFrequency m) G

Definitional equivalence: an IsELSolution is exactly an ERM solution under uniform weights. Paper-canonical: paper §3.1 of [GB24] introduces EL as the special case of FIL with Q = I. Here uniformFrequency = fun _ => 1 is the discrete analogue.

Per-coordinate residual optimality #

The loss separates across form coordinates: each column of G regresses one cue's support on the semantic dimensions, "optimal in the least-squares sense" ([HCB26] ch. 6). So ERM is exactly columnwise unbeatability, and a coordinate that some functional decodes exactly is reproduced exactly.

def Processing.Lexical.Discriminative.coordResidual {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (pred : Fin m → ℝ) (j₀ : Fin n) :

ℝ

Weighted squared residual of a predicted column pred against form coordinate j₀ of the training data.

Equations

Processing.Lexical.Discriminative.coordResidual data q pred j₀ = ∑ k : Fin m, q k * (pred k - data.forms k j₀) ^ 2

Instances For

theorem Processing.Lexical.Discriminative.coordResidual_nonneg {m n d : ℕ} (data : TrainingExperience m n d) {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 ≤ q i) (pred : Fin m → ℝ) (j₀ : Fin n) :

0 ≤ coordResidual data q pred j₀

theorem Processing.Lexical.Discriminative.weightedLoss_eq_sum_coordResidual {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

weightedLoss data q G = ∑ j : Fin n, coordResidual data q (fun (k : Fin m) => G (data.meanings k) j) j

T-Separability: the weighted loss is the sum over form coordinates of the per-coordinate residuals — each column of the production map is an independent regression.

theorem Processing.Lexical.Discriminative.isERMSolution_iff_coordResidual {m n d : ℕ} (data : TrainingExperience m n d) (q : FrequencyVector m) (G : MeaningVec d →ₗ[ℝ] FormVec n) :

IsERMSolution data q G ↔ ∀ (j₀ : Fin n) (w : MeaningVec d →ₗ[ℝ] ℝ), coordResidual data q (fun (k : Fin m) => G (data.meanings k) j₀) j₀ ≤ coordResidual data q (fun (k : Fin m) => w (data.meanings k)) j₀

T-Coordinate-optimality: G is an ERM solution iff at every form coordinate its column's residual is at most that of any linear functional of the meanings — ERM is columnwise-unbeatable regression. No sign condition on q in either direction.

theorem Processing.Lexical.Discriminative.IsERMSolution.coord_eq_of_decodable {m n d : ℕ} {data : TrainingExperience m n d} {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 < q i) {G : MeaningVec d →ₗ[ℝ] FormVec n} (hG : IsERMSolution data q G) {j₀ : Fin n} {w : MeaningVec d →ₗ[ℝ] ℝ} (hw : ∀ (i : Fin m), w (data.meanings i) = data.forms i j₀) (i : Fin m) :

G (data.meanings i) j₀ = data.forms i j₀

T-Decodable-exact-fit: if some linear functional w of the meanings exactly reproduces coordinate j₀ of the observed forms, then any ERM solution under positive weights also reproduces it exactly on the training events. Zero-residual case of isERMSolution_iff_coordResidual.

theorem Processing.Lexical.Discriminative.isERMSolution_of_interpolates {m n d : ℕ} {data : TrainingExperience m n d} {q : FrequencyVector m} (hq : ∀ (i : Fin m), 0 ≤ q i) {G : MeaningVec d →ₗ[ℝ] FormVec n} (hG : ∀ (i : Fin m), G (data.meanings i) = data.forms i) :

IsERMSolution data q G

T-Interpolation: a map that reproduces every training form exactly is an ERM solution under any nonnegative weights — IsERMSolution is nonvacuous whenever the data are linearly interpolable.

Connection to `LinearDiscriminativeLexicon` #

def Processing.Lexical.Discriminative.LinearDiscriminativeLexicon.IsTrainedOn {m n d : ℕ} (D : LinearDiscriminativeLexicon ℝ (FormVec n) (MeaningVec d)) (data : TrainingExperience m n d) (q : FrequencyVector m) :

A LinearDiscriminativeLexicon's production map is the trained production map for given training data and frequency weights iff the production map is an ERM solution.

The substrate is agnostic about which q is empirically correct; this predicate just records the relationship between a DLM's production map and a particular cognitive theory's training data.

Equations

D.IsTrainedOn data q = Processing.Lexical.Discriminative.IsERMSolution data q D.production

Instances For

@[reducible, inline]

abbrev Processing.Lexical.Discriminative.LinearDiscriminativeLexicon.IsELTrainedOn {m n d : ℕ} (D : LinearDiscriminativeLexicon ℝ (FormVec n) (MeaningVec d)) (data : TrainingExperience m n d) :

A DLM is EL-trained for given data iff its production map is the type-uniform ERM solution.

Equations

D.IsELTrainedOn data = D.IsTrainedOn data (Processing.Lexical.Discriminative.uniformFrequency m)

Instances For

@[reducible, inline]

abbrev Processing.Lexical.Discriminative.LinearDiscriminativeLexicon.IsFILTrainedOn {m n d : ℕ} (D : LinearDiscriminativeLexicon ℝ (FormVec n) (MeaningVec d)) (data : TrainingExperience m n d) (q : FrequencyVector m) :

A DLM is FIL-trained with a given frequency vector iff its production map is the corresponding ERM solution.

Equations

D.IsFILTrainedOn data q = D.IsTrainedOn data q

Instances For

theorem Processing.Lexical.Discriminative.LinearDiscriminativeLexicon.IsTrainedOn.semSup_eq_of_decodable {m n d : ℕ} {D : LinearDiscriminativeLexicon ℝ (FormVec n) (MeaningVec d)} {data : TrainingExperience m n d} {q : FrequencyVector m} (hD : D.IsTrainedOn data q) (hq : ∀ (i : Fin m), 0 < q i) {j₀ : Fin n} {w : MeaningVec d →ₗ[ℝ] ℝ} (hw : ∀ (i : Fin m), w (data.meanings i) = data.forms i j₀) (i : Fin m) :

semSup D (data.meanings i) j₀ = data.forms i j₀

A trained DLM's semantic support at a linearly decodable form coordinate equals the observed form value on every positively-weighted training event; any contrast carried by that coordinate — categorical or graded — transfers to semSup exactly.

theorem Processing.Lexical.Discriminative.LinearDiscriminativeLexicon.IsTrainedOn.semSupWord_eq_of_decodable {m n d : ℕ} {D : LinearDiscriminativeLexicon ℝ (FormVec n) (MeaningVec d)} {data : TrainingExperience m n d} {q : FrequencyVector m} (hD : D.IsTrainedOn data q) (hq : ∀ (i : Fin m), 0 < q i) {js : List (Fin n)} (hw : ∀ j ∈ js, ∃ (w : MeaningVec d →ₗ[ℝ] ℝ), ∀ (i : Fin m), w (data.meanings i) = data.forms i j) (i : Fin m) :

semSupWord D (data.meanings i) js = (List.map (fun (j : Fin n) => data.forms i j) js).sum

[HCB26]'s Semantic Support for Form — semSupWord over a word's cue coordinates — equals the sum of the observed form values whenever each coordinate is linearly decodable from the meanings.