Documentation

Linglib.Phenomena.Morphology.Studies.RathiHahnFutrell2026

@cite{rathi-hahn-futrell-2026}: Information-theoretic morphological fusion #

@cite{rathi-hahn-futrell-2026} @cite{rathi-hahn-futrell-2021} @cite{hahn-degen-futrell-2021} @cite{hahn-mathew-degen-2021}

@cite{rathi-hahn-futrell-2026} extend the Memory-Surprisal Tradeoff (@cite{hahn-degen-futrell-2021}, formalized in Theories/Processing/MemorySurprisal/Basic.lean and consumed in Phenomena/Morphology/Studies/HahnDegenFutrell2021.lean) from word/morpheme order to four further morphological phenomena:

  1. Polyexponence — features fused into a single morpheme.
  2. Suppletion — root forms changing unpredictably with grammatical features.
  3. Category clustering — mutually exclusive feature values in consistent slots.
  4. Pairwise informational fusion — gradable measure of how cell-pair forms resist decomposition into per-feature morphemes.

The central theoretical move: define informational fusion as the surprisal of a form given all-but-one feature held out from a learner's training data (eq. 4 of the paper). This generalizes binary polyexponence to a continuous quantity that handles nonconcatenative morphology (Semitic root-and-pattern, ablaut, suppletion, reduplication) without committing to a morpheme segmentation.

Structure #

Numerics that are NOT Lean-internal #

The paper's empirical results are computed by an external LSTM seq2seq learner on UniMorph 3.0 paradigm data (@cite{mccarthy-kirov-2020}, @cite{sylak-glassman-2015}). Specific φ₂ values, Pareto AUC values, and permutation-test p-values cannot be re-derived inside Lean — the LSTM is the authoritative source. All such values below are flagged -- UNVERIFIED: per CLAUDE.md hallucination-prevention rules; they are recorded here for cross- study reference and human verification against the published figures.

Substantive concerns flagged #

The number-vs-case suppletion result (§4 below, 15/17 languages confirm the ETH prediction) glosses over a known counterexample: Russian čelovek/ljudi ('person') coexists with case-and-number suppletion god/let ('year') where the GEN.PL form is suppletive in both number AND case. @cite{moskal-2015} advances the asymmetry as a structural universal under a phase-based locality account; @cite{rathi-hahn-futrell-2026} treat it as a frequency- based regularity averaged over the typological sample. These are different claims about the same data; this study file follows the paper but flags the distinction.

Tables 4 (L_agg vs L_fus) and 6 (L_clustered vs L_nonclustered) #

Each toy "language" maps two binary input features (X₁ ∈ {ACTIVE, PASSIVE}, X₂ ∈ {PRESENT, PAST}) to a 2-character string Y₁Y₂ ∈ {A,B,C,D}². The four (X₁, X₂) combinations have frequencies 3/8, 1/8, 1/8, 3/8 — chosen so that mutual information I[X₁; X₂] > 0 while marginals are uniform.

We encode each (X₁, X₂) combination as a separate InflectionClass 2 (a two-cell paradigm: cell 0 is the first character, cell 1 is the second). Frequencies attach to inflection classes, recovering the table-4 weights.

Paper Table 4:

Probability  (X₁, X₂)              L_agg form  L_fus form
3/8          ACTIVE, PRESENT       AC          AC
1/8          ACTIVE, PAST          AD          AD
1/8          PASSIVE, PRESENT      BC          BD
3/8          PASSIVE, PAST         BD          BC

L_agg: each character independently encodes one feature (Y₁=A iff X₁=ACTIVE, Y₂=C iff X₂=PRESENT). L_fus: Y₂ is the XOR of X₁ and X₂ (Controlled-NOT).

L_agg, the agglutinative toy language. n=2 cells (first character, second character); 4 inflection classes (one per (X₁, X₂) combination) with frequencies 3/8, 1/8, 1/8, 3/8.

Equations
  • One or more equations did not get rendered due to their size.
Instances For

    L_fus, the fusional toy language. Same paradigm shape as L_agg but the second character of each form depends on BOTH input features (XOR).

    Equations
    • One or more equations did not get rendered due to their size.
    Instances For

      E-complexity (@cite{ackerman-malouf-2013}): both toy languages have 4 inflection classes.

      The cell-1 distribution of L_agg has support of exactly two forms ("C", "D"), which means Y₂ takes both values — supporting H[Y₂_agg] = log 2 (paper Appendix A1 step 1). The numerical entropy comparison itself goes through PMF.conditionalEntropy_le_entropy; the structural witness is here. The actual list (with kernel-evaluated rationals) is: [("C", 1/2), ("D", 1/2)] — the marginal is uniform.

      The cell-1 distribution of L_fus also has support of two forms — but with concentrated weights [("C", 3/4), ("D", 1/4)] (kernel-evaluated). The concentration is the structural condition for H[Y₂_fus] < log 2.

      Both toy languages have the same number of cell-0 distinct realizations (two: "A" and "B"). Fusion in L_fus does not affect cell 0.

      Table 6: L_clustered vs L_nonclustered (paper §3.3) #

      Two binary input features X₁, X₂; same 4 forms in 4-element string set {A,B,C,D}. L_clustered places voice (X₁) before tense (X₂) consistently; L_nonclustered alternates the order based on voice value.

      L_clustered: voice always realized in slot 0, tense always in slot 1.

      Equations
      • One or more equations did not get rendered due to their size.
      Instances For

        L_nonclustered: voice morpheme in slot 0 for ACTIVE; in slot 1 for PASSIVE. The same two morpheme inventories occupy different slots depending on the value of X₁.

        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          Paper Appendix A1: fusion can lower local surprisal #

          The abstract theorem fusion_can_lower_marginal_entropy lives in Theories/Processing/MemorySurprisal/InformationalFusion.lean and is a direct re-export of PMF.conditionalEntropy_le_entropy (Cover-Thomas 2.6.4).

          The instantiation here exhibits L_fus and L_agg as concrete witnesses: both have the same first-character distribution (L_agg_L_fus_cell0_match), but L_fus's second-character distribution is strictly more concentrated (6/8 C vs 4/8 C — see L_fus_cell1_distribution), so its marginal Shannon entropy is strictly lower. The conclusion follows from conditionalEntropy_le_entropy applied to the joint (X₁, X₂)-distribution implicit in the toy paradigm construction. The numerical strict inequality H[Y₂_fus] < H[Y₂_agg] requires Real.log arithmetic and lives in the abstract substrate; the structural witnesses (the distributions) live here.

          L_fus's cell-1 distribution has 6/8 mass on a single form (C), making it strictly more concentrated than L_agg's uniform 4/8/4/8. Direct from the explicit lists in L_fus_cell1_distribution and L_agg_cell1_distribution. The substrate's fusion_can_lower_marginal_entropy (Cover-Thomas 2.6.4) applies to the underlying joint distribution.

          Paper Appendix A2: fusion of independent features increases remote uncertainty #

          The abstract theorem agglutination_lowers_remote_uncertainty lives in InformationalFusion.lean and exposes mutualInformation_nonneg (Cover-Thomas 2.6.5) as the building block. The paper's Table 5 gives a 3-feature toy with I[X₁; X₂] = 0 and I[X₂; X₃] > 0; we omit the 3-feature instantiation here as it requires a ParadigmSystem 3 with 8 inflection classes — the structural pattern is identical to A1, with the substrate theorem doing the work.

          Paper Appendix A3: category clustering lowers local surprisal #

          The structural witness: in L_clustered, knowing the slot-0 morpheme uniquely predicts which feature it expresses (always voice). In L_nonclustered, the same morpheme inventory {A,B,D} appears at slot 0 across multiple X₁ values. The slot-0 distribution of L_nonclustered has 3 distinct values (A, B, D) versus L_clustered's 2 (A, B); higher cardinality → potentially higher entropy.

          L_clustered's slot-0 distribution has support ≤ 2 forms (voice is always realized in slot 0; only A and B appear).

          L_nonclustered's slot-0 distribution has support 3 forms (A, B, D — voice morpheme has been "displaced" into slot 0 for some entries). Higher cardinality is the structural condition for higher entropy (paper §3.3).

          Polyexponence — features that cluster in optimal ordering #

          The paper computes optimal feature orderings for ~20 languages by minimizing the memory-surprisal AUC and asks: do features that are commonly polyexponent (person/number, case/number, TAM, PNG) cluster close together in the resulting orderings?

          Result (paper §4.2): yes. For each polyexponent feature pair/triple, the average normalized separation in optimal orderings is significantly lower than the random baseline. All four feature combinations show p < 0.001 by one-sample t-test.

          Specific normalized-separation values from paper Figs 7-8 are NOT quantified in the paper prose; only mean diamonds and 95% CIs are visible in the violin plots. We record below the qualitative result with a sample-size table.

          A polyexponent-feature group from paper §4.2: which Bybee categories participate, the language sample size, and the significance verdict.

          categories carries the typed Bybee categorization rather than a display string, so cross-framework cross-checks (e.g., polyexponent_categories_in_core_inflectional_range below) can engage the substrate's MorphCategory.peripherality directly. Display names (PNG, TAM, person/number, case/number) are recorded inline at the constructor sites.

          • categories : List Bybee1985.BybeeCategory

            The Bybee categories participating in this polyexponent group. Some paper-described features (notably case) have no Bybee primitive because Bybee 1985 surveys only verbal morphology; case is omitted from the typed list and noted in the constructor comment.

          • numLanguages :

            Number of UD treebank languages contributing data points.

          • significantlyClustered : Bool

            Whether the average separation is significantly below random (p < 0.001).

          Instances For
            Equations
            • One or more equations did not get rendered due to their size.
            Instances For
              Equations
              • One or more equations did not get rendered due to their size.
              Instances For
                Equations
                • One or more equations did not get rendered due to their size.
                Instances For

                  All four feature groups in the paper's sample show significant clustering.

                  Suppletion as fusion of root with feature #

                  The paper frames suppletion as fusion of the root with the suppletion- triggering grammatical feature. @cite{veselinova-2013} (WALS Ch 79) and @cite{moskal-2015} document that nominal suppletion is more commonly triggered by number than by case. The paper's prediction: in the memory-surprisal optimal ordering of features, number should be closer to the root than case across languages.

                  Result (paper §4.3, p. 18): 15 of 17 languages confirm number-before- case in optimal ordering. Two exceptions: Russian and Urdu.

                  The two exceptions are non-trivial. Russian čelovek/ljudi coexists with the case-and-number suppletion god/let (GEN.PL). @cite{moskal-2015} treats nominal-suppletion locality as a structural universal with a phase-based account; the paper treats it as a tendency derivable from average-case frequency reasoning. These are different theoretical claims about the same data.

                  Per-language suppletion-relevant ordering data from paper §4.3. The boolean records whether number is closer to the root than case in the language's optimal ordering.

                  • name : String
                  • family : String
                  • numberBeforeCase : Bool

                    True iff number-before-case in the memory-surprisal optimal ordering (the prediction the paper claims to verify).

                  Instances For
                    Equations
                    • One or more equations did not get rendered due to their size.
                    Instances For
                      Equations
                      • One or more equations did not get rendered due to their size.
                      Instances For
                        Equations
                        • One or more equations did not get rendered due to their size.
                        Instances For

                          Russian and Urdu are the only languages in the sample where the paper's number-before-case prediction fails. This is the substantive structural claim — it survives sample growth as long as Russian and Urdu remain the only counterexamples. The "15 of 17" docstring count is by construction; per linglib's feedback_no_aggregate_count_theorems.md, aggregate-count theorems go stale on data revision and are omitted here.

                          Spanish amar paradigm (paper Table 7) #

                          The paper estimates pairwise informational fusion φ₂(f₁, f₂) for Spanish verbal feature pairs using an LSTM seq2seq learner trained on UniMorph 3.0. Two specific values are cited in the prose:

                          The values are produced by the LSTM, not derivable inside Lean. We record them as documented constants with -- UNVERIFIED: flags.

                          The qualitative claim — feature pairs with high φ₂ are CLOSE in optimal ordering — is verified empirically by the paper's Pareto-frontier permutation test (Fig 10), p < 0.05 for all four languages tested (Arabic, Latin, Portuguese, Spanish).

                          φ₂(IMPF, SG) for Spanish amar ≈ 2.71 bits — low: regular morphology (amar + ba + s decomposes cleanly). Stored × 100.

                          Equations
                          Instances For

                            φ₂(1, PL) for Spanish amar ≈ 46.08 bits — very high: the -mos suffix cannot be predicted from any subset of (1st-person, plural) features when both are held out from the learner. Stored × 100.

                            Equations
                            Instances For

                              The (1, PL) feature pair has substantially higher informational fusion than (IMPF, SG) — the paper's qualitative observation that the -mos suffix cannot be predicted from the rest of the paradigm.

                              Bybee Relevance ↔ informational fusion #

                              The paper §5.2 claims to operationalize Bybee 1985's "relevance" as mutual information: features more relevant to the stem cluster close, and fuse, due to high mutual information.

                              This is a substantive reframing, not a translation. Bybee's "relevance" is defined in terms of semantic effect on stem denotation (valence > voice > aspect > tense > mood > agreement; @cite{bybee-1985} Ch 2 §3 p. 20). Mutual information is a usage-statistic. These coincide to the extent that semantic relevance drives co-occurrence regularity, but they are distinct constructs and can in principle diverge.

                              The substrate's Core.Morphology.MorphCategory.peripherality numerically encodes Bybee's hierarchy as constants (MorphRule.lean:264-276). The bridge Bybee1985.toMorphCategory : BybeeCategory → MorphCategory (Bybee1985.lean:248-257) connects the paper-typed enum to the substrate. @cite{rathi-hahn-futrell-2026}'s reframing makes those constants potentially derivable from MI on a large multilingual corpus.

                              The substrate retains the Bybee primitive (option (a) in the cross- framework audit); the cross-check theorem below tracks polyexponentGroups data through toMorphCategory ∘ peripherality, so editing the data table drives the theorem (strong test: adding a feature group containing nonfinite (rank 9) breaks the theorem).

                              Polyexponent feature groups stay within the core inflectional band: every category appearing in polyexponentGroups has substrate peripherality ≤ 8 (the agreement rank). None involves the linglib- extension categories nonfinite (rank 9), the lexical-derivational derivation (rank 1), or the adjectival degree (rank 5).

                              The theorem engages the substrate primitives directly via toMorphCategory ∘ peripherality, so:

                              • Strong test passes: adding a category with peripherality > 8 to any polyexponentGroups entry breaks compilation.
                              • No silent disagreement: the rank values come from MorphCategory.peripherality, not a duplicate local table.

                              Note: in the substrate, all four agreement subtypes (personAgr/numberAgr/genderAgr/personAgrObj) collapse to MorphCategory.agreement _ at rank 8. This is faithful to Bybee (Ch 2 §3 collapses agreement subtypes; only object-person is reported separately, and number-as-agreement vs nominal number is a separate distinction the substrate flags).

                              Mirror Principle ↔ ETH #

                              @cite{baker-1985}'s Mirror Principle holds that morpheme order reflects syntactic-derivation order. @cite{rathi-hahn-futrell-2026} §5.1 argues ETH makes compatible predictions for affix order, but for a different reason: ETH derives ordering from on-line processing efficiency rather than from underlying syntactic structure.

                              A formal bridge mirror_compatible_with_information_locality would need Phenomena/Morphology/Studies/Baker1985.lean to import the memory-surprisal substrate. This is out of scope for the current study file; the existing Baker 1985 study exists as Phenomena/Morphology/Studies/Baker1985.lean and should eventually carry such a bridge.

                              DM Fusion ↔ informational fusion #

                              @cite{halle-marantz-1993} treat fusion as postsyntactic merger of adjacent terminals (Theories/Morphology/DM/PostSyntacticOps.lean::FusionRule). @cite{rathi-hahn-futrell-2026}'s informational fusion is a usage-statistic on surface paradigm forms. The two are different mathematical objects with overlapping names; a formal bridge dm_fusion_implies_high_mi is out of scope but would tie the substrate together.

                              Ackerman-Malouf 2013 LCEC ↔ informational fusion #

                              @cite{ackerman-malouf-2013}'s i-complexity is (1/n(n-1)) Σᵢ≠ⱼ H(Cᵢ|Cⱼ), the average pairwise conditional entropy across cells of the same paradigm system. @cite{rathi-hahn-futrell-2026}'s pairwise informational fusion φ₂(f₁,f₂) is closely related — both measure paradigm-internal predictability. With Phase 3 of the substrate restructure, both consumers share Core.Morphology.ParadigmSystem and route through PMF.conditionalEntropy. A formal bridge iComplexity ↔ MutualInfoProfile.totalInfo is left for future work; the shared substrate makes such a bridge syntactically straightforward.