Documentation

Linglib.Studies.ODonnell2015

O'Donnell 2015: English derivational morphology #

First study file using the FG-family substrate from Morphology/FragmentGrammars/. Demonstrates the API on the central empirical contrast of [OD15] Chapter 7 (Fig 7.3, p. 262): the productivity contrast between the highly productive English nominaliser -ness and the unproductive -ion and -ate.

Empirical content #

The book's Chapter 7 load-bearing claim is qualitative: -ness:Adj>N is productive; -ion:V>N and -ate:BND>V are not. On Fig 7.3 (p. 262), only the FG model places -ness in its top-5 productive suffixes; all four competing models (DMPCFG, MAG, DOP1, ENDOP) rank -ion first or second, and three of those (DMPCFG, DOP1, ENDOP) also wrongly elevate -ate (pp. 261–263). Table 7.1 (p. 265) adds that only FG correlates strongly with Baayen's hapax-based productivity estimators. Suffix.productivityIndex encodes a strict ordering ness > ion > ate; the ion > ate half is a tie-break (both are unproductive on novel forms but -ate is structurally more restricted), not part of [OD15]'s central contrast.

Note that -ate is not a nominaliser — it is a verb-forming suffix that selects bound stems (e.g. segregate from bound segregat-). The toy grammar below reflects this: rAte produces V, not N, with a BND (bound-stem) nonterminal as its argument. The three suffixes are grouped here by being the central derivational contrast of [OD15] Ch 7, not by sharing an output category.

DMPCFG critique (Ch 7) #

The DMPCFG model bases its productivity inferences on the token frequency of suffixes ([OD15] Ch 7, p. 268). Per [OD15] Fig 7.4 (p. 267), -ion has roughly an order of magnitude more CELEX tokens than -ness, so a learned DMPCFG posterior places -ion above -ness in productivity — exactly the failure mode [OD15] uses to discriminate FG from DMPCFG. The pseudo-counts in dmpcfgFromObserved are stipulated to track the empirical productivity (via productivityIndex), not learned from a corpus. Two PMF-form theorems below (…_prior_lt and …_lt_of_count_gap) make the prior + flip dichotomy Lean-checkable.

References #

[OD15] Ch 6–7.

The three suffixes #

[OD15] Chapter 7's central productivity contrast (pp. 261–263). The terms "productive" and "unproductive" are pre-theoretic descriptions consistent with the literature the book reviews; the data below commits to nothing about why one suffix is productive and another is not.

inductive ODonnell2015.Suffix :

The three English derivational suffixes of the Chapter 7 contrast.

-ness (Adj>N): "perhaps the most commonly-discussed productive suffix in English" (p. 261); pine-scented → pine-scentedness.
-ion (V>N): high type and token frequency but unproductive on novel verbs — the competing models' "obviously absurd prediction" is that it attaches to arbitrary verbs, producing *meetion "a MEETING event" (pp. 261–262).
-ate (BND>V): a verb-forming suffix "restricted, by its categorial definition, from attaching to anything besides bound stems" (p. 263), e.g. segregate from bound segregat-.

ness : Suffix
ion : Suffix
ate : Suffix

Instances For

@[implicit_reducible]

instance ODonnell2015.instDecidableEqSuffix :

DecidableEq Suffix

Equations

ODonnell2015.instDecidableEqSuffix x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

@[implicit_reducible]

instance ODonnell2015.instReprSuffix :

Equations

ODonnell2015.instReprSuffix = { reprPrec := ODonnell2015.instReprSuffix.repr }

def ODonnell2015.instReprSuffix.repr :

Suffix → ℕ → Std.Format

Equations

ODonnell2015.instReprSuffix.repr ODonnell2015.Suffix.ness prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Suffix.ness")).group prec✝
ODonnell2015.instReprSuffix.repr ODonnell2015.Suffix.ion prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Suffix.ion")).group prec✝
ODonnell2015.instReprSuffix.repr ODonnell2015.Suffix.ate prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Suffix.ate")).group prec✝

Instances For

def ODonnell2015.Suffix.productivityIndex :

Suffix → ℕ

A pre-theoretic productivity index for the three suffixes — higher is more productive. Coding ness > ion > ate reproduces the ordering implied by [OD15] Chapter 7 (Fig 7.3 and the §7.3.1.1 discussion). The ion > ate direction is a tie-break: both are unproductive on novel forms, but ate is structurally more restricted (bound stems only), so we rank it strictly lower.

Equations

Instances For

def ODonnell2015.moreProductiveThan (a b : Suffix) :

The pre-theoretic strict ordering on the three suffixes by productivity. Any theory of productivity that purports to account for the [OD15] Chapter 7 data must reproduce this ordering; failure to do so falsifies the theory against the data (this is exactly the discriminator deployed against DMPCFG / MAG / DOP1 / ENDOP in Fig 7.3, all of which place -ion in their top 5).

Equations

ODonnell2015.moreProductiveThan a b = (a.productivityIndex > b.productivityIndex)

Instances For

@[implicit_reducible]

instance ODonnell2015.instDecidableRelSuffixMoreProductiveThan :

DecidableRel moreProductiveThan

Equations

ODonnell2015.instDecidableRelSuffixMoreProductiveThan a b = ODonnell2015.instDecidableRelSuffixMoreProductiveThan._aux_1 a b

Frequency-spectrum statistics (Fig 7.4, pp. 267–268) #

The book's distributional evidence: -ness has the "large number of rare events" (LNRE) shape characteristic of a productive process — a spectrum "sharply peaked at low-frequency forms"; -ion's spectrum has few hapaxes and spreads its mass through higher frequency ranges (cf. §1.2.6 and Fig 1.1 on unproductive -ity/-th). Fig 7.4 reports spectra for -ness and -ion only; the book gives no spectrum for -ate, whose unproductivity is categorial (bound stems only).

structure ODonnell2015.SpectrumStats :

Corpus statistics for a suffix in the Chapter 7 training corpus (CELEX-derived): word types, word tokens, hapax legomena.

wordTypes : ℕ
wordTokens : ℕ
hapaxes : ℕ

Instances For

def ODonnell2015.instDecidableEqSpectrumStats.decEq (x✝ x✝¹ : SpectrumStats) :

Decidable (x✝ = x✝¹)

Equations

One or more equations did not get rendered due to their size.

Instances For

@[implicit_reducible]

instance ODonnell2015.instDecidableEqSpectrumStats :

DecidableEq SpectrumStats

Equations

ODonnell2015.instDecidableEqSpectrumStats = ODonnell2015.instDecidableEqSpectrumStats.decEq

@[implicit_reducible]

instance ODonnell2015.instReprSpectrumStats :

Repr SpectrumStats

Equations

ODonnell2015.instReprSpectrumStats = { reprPrec := ODonnell2015.instReprSpectrumStats.repr }

def ODonnell2015.instReprSpectrumStats.repr :

SpectrumStats → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

def ODonnell2015.nessStats :

-ness: 1024 word types, 15,568 tokens, 350 hapaxes ([OD15] pp. 267–268). LNRE-shaped: hapax-rich, spectrum peaked at frequency 1 (Fig 7.4, left).

Equations

ODonnell2015.nessStats = { wordTypes := 1024, wordTokens := 15568, hapaxes := 350 }

Instances For

def ODonnell2015.ionStats :

-ion: 1117 word types, 162,573 tokens, 83 hapaxes ([OD15] pp. 267–268). Not LNRE-shaped: hapax-poor, mass spread toward higher frequencies (Fig 7.4, right).

Equations

ODonnell2015.ionStats = { wordTypes := 1117, wordTokens := 162573, hapaxes := 83 }

Instances For

theorem ODonnell2015.ness_hapax_richer :

nessStats.hapaxes * ionStats.wordTypes > ionStats.hapaxes * nessStats.wordTypes

-ness is hapax-richer than -ion (350/1024 vs 83/1117) — the distributional fingerprint of productivity that Baayen's hapax-based estimators measure and that the FG model exploits (p. 268). Stated by cross-multiplication to stay in Nat.

theorem ODonnell2015.ness_higher_type_token_ratio :

nessStats.wordTypes * ionStats.wordTokens > ionStats.wordTypes * nessStats.wordTokens

-ness has the higher type–token ratio (1024/15,568 vs 1117/162,573): -ion's distribution is dominated by reuse of high-frequency existing words, not novel coinage.

theorem ODonnell2015.ion_token_frequency_dominates :

ionStats.wordTokens > 10 * nessStats.wordTokens

-ion has more than an order of magnitude more tokens than -ness — the token-frequency gap that misleads DMPCFG, which "bases productivity inferences purely on the token frequency of suffixes" (p. 268).

Toy CFG #

inductive ODonnell2015.Sym :

The six terminal symbols of the toy derivational grammar: sentinels adj/v/bnd for adjective, verb and bound-stem bases, plus the three derivational suffixes -ness, -ion, -ate.

adj : Sym
v : Sym
bnd : Sym
ness : Sym
ion : Sym
ate : Sym

Instances For

@[implicit_reducible]

instance ODonnell2015.instDecidableEqSym :

DecidableEq Sym

Equations

ODonnell2015.instDecidableEqSym x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

def ODonnell2015.instReprSym.repr :

Sym → ℕ → Std.Format

Equations

ODonnell2015.instReprSym.repr ODonnell2015.Sym.adj prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Sym.adj")).group prec✝
ODonnell2015.instReprSym.repr ODonnell2015.Sym.v prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Sym.v")).group prec✝
ODonnell2015.instReprSym.repr ODonnell2015.Sym.bnd prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Sym.bnd")).group prec✝
ODonnell2015.instReprSym.repr ODonnell2015.Sym.ness prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Sym.ness")).group prec✝
ODonnell2015.instReprSym.repr ODonnell2015.Sym.ion prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Sym.ion")).group prec✝
ODonnell2015.instReprSym.repr ODonnell2015.Sym.ate prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.Sym.ate")).group prec✝

Instances For

@[implicit_reducible]

instance ODonnell2015.instReprSym :

Repr Sym

Equations

ODonnell2015.instReprSym = { reprPrec := ODonnell2015.instReprSym.repr }

inductive ODonnell2015.SuffixNT :

The four nonterminals of the toy derivational grammar. BND represents a bound stem — the selectional restriction of -ate (cf. segregat-, demonstrat-).

Instances For

@[implicit_reducible]

instance ODonnell2015.instDecidableEqSuffixNT :

DecidableEq SuffixNT

Equations

ODonnell2015.instDecidableEqSuffixNT x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

@[implicit_reducible]

instance ODonnell2015.instReprSuffixNT :

Equations

ODonnell2015.instReprSuffixNT = { reprPrec := ODonnell2015.instReprSuffixNT.repr }

def ODonnell2015.instReprSuffixNT.repr :

SuffixNT → ℕ → Std.Format

Equations

ODonnell2015.instReprSuffixNT.repr ODonnell2015.SuffixNT.N prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.SuffixNT.N")).group prec✝
ODonnell2015.instReprSuffixNT.repr ODonnell2015.SuffixNT.A prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.SuffixNT.A")).group prec✝
ODonnell2015.instReprSuffixNT.repr ODonnell2015.SuffixNT.V prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.SuffixNT.V")).group prec✝
ODonnell2015.instReprSuffixNT.repr ODonnell2015.SuffixNT.BND prec✝ = Repr.addAppParen (Std.Format.nest (if prec✝ ≥ 1024 then 1 else 2) (Std.Format.text "ODonnell2015.SuffixNT.BND")).group prec✝

Instances For

def ODonnell2015.rNess :

ContextFreeRule Sym SuffixNT

Rule N → A · ness.

Equations

ODonnell2015.rNess = { input := ODonnell2015.SuffixNT.N, output := [Symbol.nonterminal ODonnell2015.SuffixNT.A, Symbol.terminal ODonnell2015.Sym.ness] }

Instances For

def ODonnell2015.rIon :

ContextFreeRule Sym SuffixNT

Rule N → V · ion.

Equations

ODonnell2015.rIon = { input := ODonnell2015.SuffixNT.N, output := [Symbol.nonterminal ODonnell2015.SuffixNT.V, Symbol.terminal ODonnell2015.Sym.ion] }

Instances For

def ODonnell2015.rAte :

ContextFreeRule Sym SuffixNT

Rule V → BND · ate. Reflects [OD15]'s -ate:BND>V classification (p. 261): -ate is a verb-forming suffix that selects bound stems, not a noun-forming suffix.

Equations

ODonnell2015.rAte = { input := ODonnell2015.SuffixNT.V, output := [Symbol.nonterminal ODonnell2015.SuffixNT.BND, Symbol.terminal ODonnell2015.Sym.ate] }

Instances For

def ODonnell2015.rAdj :

ContextFreeRule Sym SuffixNT

Rule A → adj.

Equations

ODonnell2015.rAdj = { input := ODonnell2015.SuffixNT.A, output := [Symbol.terminal ODonnell2015.Sym.adj] }

Instances For

def ODonnell2015.rV :

ContextFreeRule Sym SuffixNT

Rule V → v.

Equations

ODonnell2015.rV = { input := ODonnell2015.SuffixNT.V, output := [Symbol.terminal ODonnell2015.Sym.v] }

Instances For

def ODonnell2015.rBnd :

ContextFreeRule Sym SuffixNT

Rule BND → bnd.

Equations

ODonnell2015.rBnd = { input := ODonnell2015.SuffixNT.BND, output := [Symbol.terminal ODonnell2015.Sym.bnd] }

Instances For

def ODonnell2015.suffixGrammar :

ContextFreeGrammar Sym

The toy CFG: nominalisation via -ness (from adjective) or -ion (from verb), verb formation via -ate (from bound stem).

Equations

One or more equations did not get rendered due to their size.

Instances For

@[implicit_reducible]

instance ODonnell2015.instDecidableEqNTSymSuffixGrammar :

DecidableEq suffixGrammar.NT

DecidableEq for the grammar's NT projection — needed by DMPCFG's typeclass arguments. Not synthesised automatically because suffixGrammar.NT is a structure projection that the typeclass solver does not reduce to SuffixNT.

Equations

ODonnell2015.instDecidableEqNTSymSuffixGrammar = ODonnell2015.instDecidableEqNTSymSuffixGrammar._aux_1

Bridge from data layer + DMPCFG instance #

def ODonnell2015.suffixToRule :

Suffix → ContextFreeRule Sym SuffixNT

Bridge from Suffix to the rules of this grammar.

Equations

Instances For

def ODonnell2015.pseudoVal (r : ContextFreeRule Sym SuffixNT) :

ℝ

Per-rule pseudo-count for the toy grammar. The three productivity-bearing rules get productivityIndex + 1 (so ness ↦ 3, ion ↦ 2, ate ↦ 1), inheriting both the strict ordering and any future revision of Suffix.productivityIndex. The three structural selectional rules get a neutral 1.

Equations

One or more equations did not get rendered due to their size.

Instances For

def ODonnell2015.dmpcfgFromObserved :

Morphology.FragmentGrammars.DMPCFG suffixGrammar

A DMPCFG over suffixGrammar whose per-rule pseudo-counts are derived from Suffix.productivityIndex (the qualitative productivity ranking). The connection is structural: revising productivityIndex changes the pseudo-counts here in lockstep.

Equations

ODonnell2015.dmpcfgFromObserved = { pseudo := ODonnell2015.pseudoVal, pseudo_pos := ODonnell2015.dmpcfgFromObserved._proof_1 }

Instances For

Plumbing: named N-bucket witnesses + parametric pseudoVal lemma #

instance ODonnell2015.n_bucket_nonempty :

Nonempty (suffixGrammar.RulesWithLHS SuffixNT.N)

The N-LHS bucket of suffixGrammar is nonempty (rNess ∈ it). Required for mapWeightPMF and mapWeight_sum_eq_one_of_lhs.

instance ODonnell2015.suffixGrammar_buckets_nonempty (a : suffixGrammar.NT) :

Nonempty (suffixGrammar.RulesWithLHS a)

All four LHS buckets of suffixGrammar are nonempty: every nonterminal in this toy grammar has at least one rule expanding it (N has rNess + rIon, A has rAdj, V has rAte + rV, BND has rBnd).

Required to construct dmpcfgFromObserved.posteriorMAP D as a full MultinomialPCFG suffixGrammar (the structure carries the typeclass [∀ a, Nonempty (G.RulesWithLHS a)] because PMFs over empty supports don't exist).

Theorems #

theorem ODonnell2015.corpusProb_pos_for_empty (M : Morphology.FragmentGrammars.DMPCFG suffixGrammar) :

0 < M.corpusProb 0

The FG-family API exemplified on the toy grammar: any DMPCFG over suffixGrammar assigns probability 1 — and hence positive probability — to the empty corpus. Direct corollary of DMPCFG.corpusProb_zero.

theorem ODonnell2015.dmpcfgFromObserved_pseudo_respects_productivity {a b : Suffix} (h : moreProductiveThan a b) :

dmpcfgFromObserved.pseudo (suffixToRule a) > dmpcfgFromObserved.pseudo (suffixToRule b)

Structural drift sentry: a stronger productivity ranking (moreProductiveThan) implies a larger DMPCFG pseudo-count for the corresponding rule. Propagates moreProductiveThan through pseudoVal, so this breaks if Suffix.productivityIndex is revised in a way that contradicts the rule-level encoding.

theorem ODonnell2015.dmpcfgFromObserved_mapWeightPMF_lt_of_count_gap (D : Multiset (CFGTree Sym SuffixNT)) (h : CFGTree.corpusRuleCount rNess D + 1 < CFGTree.corpusRuleCount rIon D) :

(dmpcfgFromObserved.mapWeightPMF D) ODonnell2015.nNess✝ < (dmpcfgFromObserved.mapWeightPMF D) ODonnell2015.nIon✝

The central failure mode [OD15] Ch 7 documents (p. 268; Fig 7.4 p. 267 supplies the CELEX evidence). DMPCFG posterior MAP weights track pseudo + count, so any corpus where rIon derivations exceed rNess derivations by more than 1 makes DMPCFG's PMF rank rIon above rNess — directly contradicting moreProductiveThan ness ion. The +1 threshold reflects the pseudo-count gap (pseudoVal rNess − pseudoVal rIon = 3 − 2 = 1); once corpus counts overcome the prior gap, frequency dominates.

O'Donnell's CELEX numbers in Fig 7.4 (-ion: ~162k tokens vs -ness: ~16k tokens) leave the gap an order of magnitude larger than +1, so the conclusion holds for realistic data; the hypothesis is the abstract minimum that suffices.

theorem ODonnell2015.dmpcfgFromObserved_mapWeightPMF_prior_lt :

(dmpcfgFromObserved.mapWeightPMF 0) ODonnell2015.nIon✝ < (dmpcfgFromObserved.mapWeightPMF 0) ODonnell2015.nNess✝

Prior PMF (empty corpus): DMPCFG correctly orders the N-rules of suffixGrammar. With no data, the posterior IS the prior (per mapWeight_zero), and the prior IS the per-LHS-normalised pseudo-counts. Since pseudoVal rNess > pseudoVal rIon by construction, the PMF mass at rNess exceeds that at rIon.

The first half of the [OD15] Ch 7 critique of DMPCFG: it does not start wrong. The model's failure mode is data-driven, not prior-driven.

theorem ODonnell2015.dmpcfgFromObserved_posteriorMAP_prior_lt :

((dmpcfgFromObserved.posteriorMAP 0).rulePMF SuffixNT.N) ODonnell2015.nIon✝ < ((dmpcfgFromObserved.posteriorMAP 0).rulePMF SuffixNT.N) ODonnell2015.nNess✝

Bridge demo. The same prior comparison stated as a fact about dmpcfgFromObserved.posteriorMAP 0 — a MultinomialPCFG suffixGrammar derived from the DMPCFG via the conjugate-prior collapse.

This is the proof-of-life that the DMPCFG → MultinomialPCFG bridge cashes out: any DMPCFG-side PMF fact translates straight to a MultinomialPCFG-side fact about the posterior MAP, via posteriorMAP_rulePMF. Future cross-paper consumers (Albright-Hayes, Bybee, dual-route) can target MultinomialPCFG and have their theorems automatically apply to DMPCFG-derived posteriors.

The full [OD15] Ch 7 critique of DMPCFG, in one theorem. Two facts that look contradictory but aren't:

Without data (empty corpus), DMPCFG's PMF over the N-rules ranks rNess above rIon — matching the data-layer productivityIndex.
Given a corpus with sufficiently many rIon derivations (more than rNess by more than the pseudo-count gap of 1), the PMF flips and ranks rIon above rNess — contradicting the empirical productivity ordering [OD15] reports for English.

Per Ch 7 (Fig 7.4 p. 267), DMPCFG is built with the right prior but bases its posterior on pseudo + count, so when CELEX-scale token frequencies hit the model the data overwhelms the prior and the posterior ranking flips. The fix the book proposes — Fragment Grammars — gives a different posterior structure that doesn't collapse productivity into raw frequency.