Documentation

Linglib.Phenomena.Morphology.Studies.ODonnell2015

O'Donnell 2015: English derivational morphology #

@cite{odonnell-2015}

First study file using the FG-family substrate from Theories/Morphology/FragmentGrammars/. Demonstrates the API on the central empirical contrast of @cite{odonnell-2015} Chapter 7 (Fig 7.3, p. 262 / Table 7.1, p. 265): the productivity contrast between the highly productive English nominaliser -ness and the unproductive -ion and -ate. Data anchor: Phenomena/Morphology/Productivity/FrequencySpectrum.lean.

Empirical content #

The book's Chapter 7 load-bearing claim is qualitative: -ness:Adj>N is productive; -ion:V>N and -ate:BND>V are not. On Fig 7.3 (p. 262), only the FG model places -ness in its top-5 productive suffixes; DMPCFG, MAG, DOP1 and ENDOP all wrongly elevate -ion, and three of those also wrongly elevate -ate. The data file FrequencySpectrum.lean encodes a strict ordering ness > ion > ate via Suffix.productivityIndex; the ion > ate half is a tie-break (both are unproductive on novel forms but -ate is structurally more restricted), not part of @cite{odonnell-2015}'s central contrast.

Note that -ate is not a nominaliser — it is a verb-forming suffix that selects bound stems (e.g. segregate from bound segregat-). The toy grammar below reflects this: rAte produces V, not N, with a BND (bound-stem) nonterminal as its argument. The three suffixes are grouped here by being the central derivational contrast of @cite{odonnell-2015} Ch 7, not by sharing an output category.

DMPCFG critique (Ch 7) #

The DMPCFG model bases its productivity inferences on the token frequency of suffixes (@cite{odonnell-2015} Ch 7, p. 268). Per @cite{odonnell-2015} Fig 7.4 (p. 267), -ion has roughly an order of magnitude more CELEX tokens than -ness, so a learned DMPCFG posterior places -ion above -ness in productivity — exactly the failure mode @cite{odonnell-2015} uses to discriminate FG from DMPCFG. The pseudo-counts in dmpcfgFromObserved are stipulated to track the empirical productivity (via productivityIndex), not learned from a corpus. Two PMF-form theorems below (…_prior_lt and …_lt_of_count_gap) make the prior + flip dichotomy Lean-checkable.

References #

Toy CFG #

The six terminal symbols of the toy derivational grammar: sentinels adj/v/bnd for adjective, verb and bound-stem bases, plus the three derivational suffixes -ness, -ion, -ate.

Instances For
    @[implicit_reducible]
    Equations
    Equations
    • One or more equations did not get rendered due to their size.
    Instances For

      The four nonterminals of the toy derivational grammar. BND represents a bound stem — the selectional restriction of -ate (cf. segregat-, demonstrat-).

      Instances For
        @[implicit_reducible]
        Equations
        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          Rule N → A · ness.

          Equations
          • One or more equations did not get rendered due to their size.
          Instances For

            Rule N → V · ion.

            Equations
            • One or more equations did not get rendered due to their size.
            Instances For

              Rule V → BND · ate. Reflects @cite{odonnell-2015}'s -ate:BND>V classification (p. 261): -ate is a verb-forming suffix that selects bound stems, not a noun-forming suffix.

              Equations
              • One or more equations did not get rendered due to their size.
              Instances For

                Rule BND → bnd.

                Equations
                • One or more equations did not get rendered due to their size.
                Instances For

                  The toy CFG: nominalisation via -ness (from adjective) or -ion (from verb), verb formation via -ate (from bound stem).

                  Equations
                  • One or more equations did not get rendered due to their size.
                  Instances For
                    @[implicit_reducible]

                    DecidableEq for the grammar's NT projection — needed by DMPCFG's typeclass arguments. Not synthesised automatically because suffixGrammar.NT is a structure projection that the typeclass solver does not reduce to SuffixNT.

                    Equations

                    Bridge from data layer + DMPCFG instance #

                    Per-rule pseudo-count for the toy grammar. The three productivity-bearing rules get productivityIndex + 1 (so ness ↦ 3, ion ↦ 2, ate ↦ 1), inheriting both the strict ordering and any future revision of Suffix.productivityIndex. The three structural selectional rules get a neutral 1.

                    Equations
                    • One or more equations did not get rendered due to their size.
                    Instances For

                      A DMPCFG over suffixGrammar whose per-rule pseudo-counts are derived from Suffix.productivityIndex (the data layer's qualitative productivity ranking). The connection is structural: if the data file revises productivityIndex, the pseudo-counts here change in lockstep.

                      Equations
                      • One or more equations did not get rendered due to their size.
                      Instances For

                        Plumbing: named N-bucket witnesses + parametric pseudoVal lemma #

                        The N-LHS bucket of suffixGrammar is nonempty (rNess ∈ it). Required for mapWeightPMF and mapWeight_sum_eq_one_of_lhs.

                        All four LHS buckets of suffixGrammar are nonempty: every nonterminal in this toy grammar has at least one rule expanding it (N has rNess + rIon, A has rAdj, V has rAte + rV, BND has rBnd).

                        Required to construct dmpcfgFromObserved.posteriorMAP D as a full MultinomialPCFG suffixGrammar (the structure carries the typeclass [∀ a, Nonempty (G.RulesWithLHS a)] because PMFs over empty supports don't exist).

                        Theorems #

                        The FG-family API exemplified on the toy grammar: any DMPCFG over suffixGrammar assigns probability 1 — and hence positive probability — to the empty corpus. Direct corollary of DMPCFG.corpusProb_zero.

                        Structural drift sentry: a stronger productivity ranking in the data layer (Phenomena/Morphology/Productivity/FrequencySpectrum) implies a larger DMPCFG pseudo-count for the corresponding rule. Propagates moreProductiveThan through pseudoVal, so this breaks if Suffix.productivityIndex is revised in a way that contradicts the rule-level encoding.

                        The central failure mode @cite{odonnell-2015} Ch 7 documents (p. 268; Fig 7.4 p. 267 supplies the CELEX evidence). DMPCFG posterior MAP weights track pseudo + count, so any corpus where rIon derivations exceed rNess derivations by more than 1 makes DMPCFG's PMF rank rIon above rNess — directly contradicting moreProductiveThan ness ion. The +1 threshold reflects the pseudo-count gap (pseudoVal rNesspseudoVal rIon = 3 − 2 = 1); once corpus counts overcome the prior gap, frequency dominates.

                        O'Donnell's CELEX numbers in Fig 7.4 (-ion: ~162k tokens vs -ness: ~16k tokens) leave the gap an order of magnitude larger than +1, so the conclusion holds for realistic data; the hypothesis is the abstract minimum that suffices.

                        Prior PMF (empty corpus): DMPCFG correctly orders the N-rules of suffixGrammar. With no data, the posterior IS the prior (per mapWeight_zero), and the prior IS the per-LHS-normalised pseudo-counts. Since pseudoVal rNess > pseudoVal rIon by construction, the PMF mass at rNess exceeds that at rIon.

                        The first half of the @cite{odonnell-2015} Ch 7 critique of DMPCFG: it does not start wrong. The model's failure mode is data-driven, not prior-driven.

                        Bridge demo. The same prior comparison stated as a fact about dmpcfgFromObserved.posteriorMAP 0 — a MultinomialPCFG suffixGrammar derived from the DMPCFG via the conjugate-prior collapse.

                        This is the proof-of-life that the DMPCFG → MultinomialPCFG bridge cashes out: any DMPCFG-side PMF fact translates straight to a MultinomialPCFG-side fact about the posterior MAP, via posteriorMAP_rulePMF. Future cross-paper consumers (Albright-Hayes, Bybee, dual-route) can target MultinomialPCFG and have their theorems automatically apply to DMPCFG-derived posteriors.

                        The full @cite{odonnell-2015} Ch 7 critique of DMPCFG, in one theorem. Two facts that look contradictory but aren't:

                        • Without data (empty corpus), DMPCFG's PMF over the N-rules ranks rNess above rIon — matching the data-layer productivityIndex.
                        • Given a corpus with sufficiently many rIon derivations (more than rNess by more than the pseudo-count gap of 1), the PMF flips and ranks rIon above rNess — contradicting the empirical productivity ordering @cite{odonnell-2015} reports for English.

                        Per Ch 7 (Fig 7.4 p. 267), DMPCFG is built with the right prior but bases its posterior on pseudo + count, so when CELEX-scale token frequencies hit the model the data overwhelms the prior and the posterior ranking flips. The fix the book proposes — Fragment Grammars — gives a different posterior structure that doesn't collapse productivity into raw frequency.