@cite{rathi-hahn-futrell-2026}: Information-theoretic morphological fusion #
@cite{rathi-hahn-futrell-2026} @cite{rathi-hahn-futrell-2021} @cite{hahn-degen-futrell-2021} @cite{hahn-mathew-degen-2021}
@cite{rathi-hahn-futrell-2026} extend the Memory-Surprisal Tradeoff
(@cite{hahn-degen-futrell-2021}, formalized in
Theories/Processing/MemorySurprisal/Basic.lean and consumed in
Phenomena/Morphology/Studies/HahnDegenFutrell2021.lean) from word/morpheme
order to four further morphological phenomena:
- Polyexponence — features fused into a single morpheme.
- Suppletion — root forms changing unpredictably with grammatical features.
- Category clustering — mutually exclusive feature values in consistent slots.
- Pairwise informational fusion — gradable measure of how cell-pair forms resist decomposition into per-feature morphemes.
The central theoretical move: define informational fusion as the surprisal of a form given all-but-one feature held out from a learner's training data (eq. 4 of the paper). This generalizes binary polyexponence to a continuous quantity that handles nonconcatenative morphology (Semitic root-and-pattern, ablaut, suppletion, reduplication) without committing to a morpheme segmentation.
Structure #
- §1: Toy languages L_agg, L_fus, L_clustered (paper Tables 4, 5, 6)
- §2: A1/A2/A3 instantiations (paper Appendix); abstract theorems live in
Theories/Processing/MemorySurprisal/InformationalFusion.lean - §3: Polyexponence empirical clustering (paper §4.2, Figs 7–8)
- §4: Suppletion — number-before-case in optimal ordering (paper §4.3)
- §5: Pairwise informational fusion in Spanish (paper §4.4, Table 7)
- §6: Cross-framework engagement (Bybee Relevance, Mirror Principle, DM Fusion)
Numerics that are NOT Lean-internal #
The paper's empirical results are computed by an external LSTM seq2seq learner
on UniMorph 3.0 paradigm data (@cite{mccarthy-kirov-2020},
@cite{sylak-glassman-2015}). Specific φ₂ values, Pareto AUC values, and
permutation-test p-values cannot be re-derived inside Lean — the LSTM is the
authoritative source. All such values below are flagged -- UNVERIFIED: per
CLAUDE.md hallucination-prevention rules; they are recorded here for cross-
study reference and human verification against the published figures.
Substantive concerns flagged #
The number-vs-case suppletion result (§4 below, 15/17 languages confirm the
ETH prediction) glosses over a known counterexample: Russian čelovek/ljudi
('person') coexists with case-and-number suppletion god/let ('year') where
the GEN.PL form is suppletive in both number AND case. @cite{moskal-2015}
advances the asymmetry as a structural universal under a phase-based
locality account; @cite{rathi-hahn-futrell-2026} treat it as a frequency-
based regularity averaged over the typological sample. These are different
claims about the same data; this study file follows the paper but flags the
distinction.
Tables 4 (L_agg vs L_fus) and 6 (L_clustered vs L_nonclustered) #
Each toy "language" maps two binary input features (X₁ ∈ {ACTIVE, PASSIVE},
X₂ ∈ {PRESENT, PAST}) to a 2-character string Y₁Y₂ ∈ {A,B,C,D}². The four
(X₁, X₂) combinations have frequencies 3/8, 1/8, 1/8, 3/8 — chosen so that
mutual information I[X₁; X₂] > 0 while marginals are uniform.
We encode each (X₁, X₂) combination as a separate InflectionClass 2 (a
two-cell paradigm: cell 0 is the first character, cell 1 is the second).
Frequencies attach to inflection classes, recovering the table-4 weights.
Paper Table 4:
Probability (X₁, X₂) L_agg form L_fus form
3/8 ACTIVE, PRESENT AC AC
1/8 ACTIVE, PAST AD AD
1/8 PASSIVE, PRESENT BC BD
3/8 PASSIVE, PAST BD BC
L_agg: each character independently encodes one feature (Y₁=A iff X₁=ACTIVE, Y₂=C iff X₂=PRESENT). L_fus: Y₂ is the XOR of X₁ and X₂ (Controlled-NOT).
L_agg, the agglutinative toy language. n=2 cells (first character, second character); 4 inflection classes (one per (X₁, X₂) combination) with frequencies 3/8, 1/8, 1/8, 3/8.
Equations
- One or more equations did not get rendered due to their size.
Instances For
L_fus, the fusional toy language. Same paradigm shape as L_agg but the second character of each form depends on BOTH input features (XOR).
Equations
- One or more equations did not get rendered due to their size.
Instances For
E-complexity (@cite{ackerman-malouf-2013}): both toy languages have
4 inflection classes.
The cell-1 distribution of L_agg has support of exactly two forms ("C", "D"),
which means Y₂ takes both values — supporting H[Y₂_agg] = log 2 (paper
Appendix A1 step 1). The numerical entropy comparison itself goes through
PMF.conditionalEntropy_le_entropy; the structural witness
is here. The actual list (with kernel-evaluated rationals) is:
[("C", 1/2), ("D", 1/2)] — the marginal is uniform.
The cell-1 distribution of L_fus also has support of two forms — but
with concentrated weights [("C", 3/4), ("D", 1/4)] (kernel-evaluated).
The concentration is the structural condition for H[Y₂_fus] < log 2.
Both toy languages have the same number of cell-0 distinct realizations (two: "A" and "B"). Fusion in L_fus does not affect cell 0.
Table 6: L_clustered vs L_nonclustered (paper §3.3) #
Two binary input features X₁, X₂; same 4 forms in 4-element string set {A,B,C,D}. L_clustered places voice (X₁) before tense (X₂) consistently; L_nonclustered alternates the order based on voice value.
L_clustered: voice always realized in slot 0, tense always in slot 1.
Equations
- One or more equations did not get rendered due to their size.
Instances For
L_nonclustered: voice morpheme in slot 0 for ACTIVE; in slot 1 for PASSIVE. The same two morpheme inventories occupy different slots depending on the value of X₁.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Both clustered languages have the same E-complexity.
Paper Appendix A1: fusion can lower local surprisal #
The abstract theorem fusion_can_lower_marginal_entropy lives in
Theories/Processing/MemorySurprisal/InformationalFusion.lean and is a
direct re-export of PMF.conditionalEntropy_le_entropy
(Cover-Thomas 2.6.4).
The instantiation here exhibits L_fus and L_agg as concrete witnesses:
both have the same first-character distribution (L_agg_L_fus_cell0_match),
but L_fus's second-character distribution is strictly more concentrated
(6/8 C vs 4/8 C — see L_fus_cell1_distribution), so its marginal Shannon
entropy is strictly lower. The conclusion follows from conditionalEntropy_le_entropy
applied to the joint (X₁, X₂)-distribution implicit in the toy paradigm
construction. The numerical strict inequality H[Y₂_fus] < H[Y₂_agg] requires
Real.log arithmetic and lives in the abstract substrate; the structural
witnesses (the distributions) live here.
L_fus's cell-1 distribution has 6/8 mass on a single form (C),
making it strictly more concentrated than L_agg's uniform 4/8/4/8.
Direct from the explicit lists in L_fus_cell1_distribution and
L_agg_cell1_distribution. The substrate's
fusion_can_lower_marginal_entropy (Cover-Thomas 2.6.4) applies to
the underlying joint distribution.
Paper Appendix A2: fusion of independent features increases remote uncertainty #
The abstract theorem agglutination_lowers_remote_uncertainty lives in
InformationalFusion.lean and exposes mutualInformation_nonneg
(Cover-Thomas 2.6.5) as the building block. The paper's Table 5 gives a
3-feature toy with I[X₁; X₂] = 0 and I[X₂; X₃] > 0; we omit the
3-feature instantiation here as it requires a ParadigmSystem 3 with 8
inflection classes — the structural pattern is identical to A1, with the
substrate theorem doing the work.
Paper Appendix A3: category clustering lowers local surprisal #
The structural witness: in L_clustered, knowing the slot-0 morpheme uniquely
predicts which feature it expresses (always voice). In L_nonclustered, the
same morpheme inventory {A,B,D} appears at slot 0 across multiple X₁ values.
The slot-0 distribution of L_nonclustered has 3 distinct values (A, B, D)
versus L_clustered's 2 (A, B); higher cardinality → potentially higher
entropy.
L_clustered's slot-0 distribution has support ≤ 2 forms (voice is always realized in slot 0; only A and B appear).
L_nonclustered's slot-0 distribution has support 3 forms (A, B, D — voice morpheme has been "displaced" into slot 0 for some entries). Higher cardinality is the structural condition for higher entropy (paper §3.3).
Polyexponence — features that cluster in optimal ordering #
The paper computes optimal feature orderings for ~20 languages by minimizing the memory-surprisal AUC and asks: do features that are commonly polyexponent (person/number, case/number, TAM, PNG) cluster close together in the resulting orderings?
Result (paper §4.2): yes. For each polyexponent feature pair/triple, the average normalized separation in optimal orderings is significantly lower than the random baseline. All four feature combinations show p < 0.001 by one-sample t-test.
Specific normalized-separation values from paper Figs 7-8 are NOT quantified in the paper prose; only mean diamonds and 95% CIs are visible in the violin plots. We record below the qualitative result with a sample-size table.
A polyexponent-feature group from paper §4.2: which Bybee categories participate, the language sample size, and the significance verdict.
categories carries the typed Bybee categorization rather than a
display string, so cross-framework cross-checks (e.g.,
polyexponent_categories_in_core_inflectional_range below) can engage
the substrate's MorphCategory.peripherality directly. Display names
(PNG, TAM, person/number, case/number) are recorded inline at the
constructor sites.
- categories : List Bybee1985.BybeeCategory
The Bybee categories participating in this polyexponent group. Some paper-described features (notably case) have no Bybee primitive because Bybee 1985 surveys only verbal morphology; case is omitted from the typed list and noted in the constructor comment.
- numLanguages : ℕ
Number of UD treebank languages contributing data points.
- significantlyClustered : Bool
Whether the average separation is significantly below random (p < 0.001).
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
All four feature groups in the paper's sample show significant clustering.
Suppletion as fusion of root with feature #
The paper frames suppletion as fusion of the root with the suppletion- triggering grammatical feature. @cite{veselinova-2013} (WALS Ch 79) and @cite{moskal-2015} document that nominal suppletion is more commonly triggered by number than by case. The paper's prediction: in the memory-surprisal optimal ordering of features, number should be closer to the root than case across languages.
Result (paper §4.3, p. 18): 15 of 17 languages confirm number-before- case in optimal ordering. Two exceptions: Russian and Urdu.
The two exceptions are non-trivial. Russian čelovek/ljudi coexists with the case-and-number suppletion god/let (GEN.PL). @cite{moskal-2015} treats nominal-suppletion locality as a structural universal with a phase-based account; the paper treats it as a tendency derivable from average-case frequency reasoning. These are different theoretical claims about the same data.
Per-language suppletion-relevant ordering data from paper §4.3. The boolean records whether number is closer to the root than case in the language's optimal ordering.
- name : String
- family : String
- numberBeforeCase : Bool
True iff number-before-case in the memory-surprisal optimal ordering (the prediction the paper claims to verify).
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Russian and Urdu are the only languages in the sample where the paper's
number-before-case prediction fails. This is the substantive structural
claim — it survives sample growth as long as Russian and Urdu remain
the only counterexamples. The "15 of 17" docstring count is by
construction; per linglib's feedback_no_aggregate_count_theorems.md,
aggregate-count theorems go stale on data revision and are omitted here.
Spanish amar paradigm (paper Table 7) #
The paper estimates pairwise informational fusion φ₂(f₁, f₂) for Spanish
verbal feature pairs using an LSTM seq2seq learner trained on UniMorph 3.0.
Two specific values are cited in the prose:
- φ₂(IMPF, SG) ≈ 2.71 bits (low; predictable from regular morphology)
- φ₂(1, PL) ≈ 46.08 bits (very high; -mos suffix unpredictable from rest)
The values are produced by the LSTM, not derivable inside Lean. We record
them as documented constants with -- UNVERIFIED: flags.
The qualitative claim — feature pairs with high φ₂ are CLOSE in optimal ordering — is verified empirically by the paper's Pareto-frontier permutation test (Fig 10), p < 0.05 for all four languages tested (Arabic, Latin, Portuguese, Spanish).
φ₂(IMPF, SG) for Spanish amar ≈ 2.71 bits — low: regular morphology
(amar + ba + s decomposes cleanly). Stored × 100.
Instances For
φ₂(1, PL) for Spanish amar ≈ 46.08 bits — very high: the -mos suffix
cannot be predicted from any subset of (1st-person, plural) features
when both are held out from the learner. Stored × 100.
Instances For
The (1, PL) feature pair has substantially higher informational fusion
than (IMPF, SG) — the paper's qualitative observation that the
-mos suffix cannot be predicted from the rest of the paradigm.
Bybee Relevance ↔ informational fusion #
The paper §5.2 claims to operationalize Bybee 1985's "relevance" as mutual information: features more relevant to the stem cluster close, and fuse, due to high mutual information.
This is a substantive reframing, not a translation. Bybee's "relevance" is defined in terms of semantic effect on stem denotation (valence > voice > aspect > tense > mood > agreement; @cite{bybee-1985} Ch 2 §3 p. 20). Mutual information is a usage-statistic. These coincide to the extent that semantic relevance drives co-occurrence regularity, but they are distinct constructs and can in principle diverge.
The substrate's Core.Morphology.MorphCategory.peripherality numerically
encodes Bybee's hierarchy as constants (MorphRule.lean:264-276). The
bridge Bybee1985.toMorphCategory : BybeeCategory → MorphCategory
(Bybee1985.lean:248-257) connects the paper-typed enum to the substrate.
@cite{rathi-hahn-futrell-2026}'s reframing makes those constants
potentially derivable from MI on a large multilingual corpus.
The substrate retains the Bybee primitive (option (a) in the cross-
framework audit); the cross-check theorem below tracks polyexponentGroups
data through toMorphCategory ∘ peripherality, so editing the data table
drives the theorem (strong test: adding a feature group containing
nonfinite (rank 9) breaks the theorem).
Polyexponent feature groups stay within the core inflectional band:
every category appearing in polyexponentGroups has substrate
peripherality ≤ 8 (the agreement rank). None involves the linglib-
extension categories nonfinite (rank 9), the lexical-derivational
derivation (rank 1), or the adjectival degree (rank 5).
The theorem engages the substrate primitives directly via
toMorphCategory ∘ peripherality, so:
- Strong test passes: adding a category with
peripherality > 8to anypolyexponentGroupsentry breaks compilation. - No silent disagreement: the rank values come from
MorphCategory.peripherality, not a duplicate local table.
Note: in the substrate, all four agreement subtypes
(personAgr/numberAgr/genderAgr/personAgrObj) collapse to
MorphCategory.agreement _ at rank 8. This is faithful to Bybee
(Ch 2 §3 collapses agreement subtypes; only object-person is reported
separately, and number-as-agreement vs nominal number is a
separate distinction the substrate flags).
Mirror Principle ↔ ETH #
@cite{baker-1985}'s Mirror Principle holds that morpheme order reflects syntactic-derivation order. @cite{rathi-hahn-futrell-2026} §5.1 argues ETH makes compatible predictions for affix order, but for a different reason: ETH derives ordering from on-line processing efficiency rather than from underlying syntactic structure.
A formal bridge mirror_compatible_with_information_locality would need
Phenomena/Morphology/Studies/Baker1985.lean to import the memory-surprisal
substrate. This is out of scope for the current study file; the existing
Baker 1985 study exists as Phenomena/Morphology/Studies/Baker1985.lean
and should eventually carry such a bridge.
DM Fusion ↔ informational fusion #
@cite{halle-marantz-1993} treat fusion as postsyntactic merger of adjacent
terminals (Theories/Morphology/DM/PostSyntacticOps.lean::FusionRule).
@cite{rathi-hahn-futrell-2026}'s informational fusion is a usage-statistic
on surface paradigm forms. The two are different mathematical objects with
overlapping names; a formal bridge dm_fusion_implies_high_mi is out of
scope but would tie the substrate together.
Ackerman-Malouf 2013 LCEC ↔ informational fusion #
@cite{ackerman-malouf-2013}'s i-complexity is (1/n(n-1)) Σᵢ≠ⱼ H(Cᵢ|Cⱼ),
the average pairwise conditional entropy across cells of the same paradigm
system. @cite{rathi-hahn-futrell-2026}'s pairwise informational fusion
φ₂(f₁,f₂) is closely related — both measure paradigm-internal
predictability. With Phase 3 of the substrate restructure, both consumers
share Core.Morphology.ParadigmSystem and route through
PMF.conditionalEntropy. A formal bridge
iComplexity ↔ MutualInfoProfile.totalInfo is left for future work; the
shared substrate makes such a bridge syntactically straightforward.