Documentation

Linglib.Theories.Processing.MemorySurprisal.InformationalFusion

Informational Fusion #

@cite{rathi-hahn-futrell-2021} @cite{rathi-hahn-futrell-2026}

Sibling of MemorySurprisal/Basic.lean. Defines the informational fusion measure introduced in @cite{rathi-hahn-futrell-2021} and used as the central empirical estimator in @cite{rathi-hahn-futrell-2026}:

φ(w | σ, ℓ, L_{-σ}) := -log p(w | σ, ℓ, L_{-σ})

The surprisal of surface form w for feature set σ of inflection class , holding out all instances of σ from the learner's training corpus L. A form is informationally fused when it cannot be predicted from any subset of its features — high φ.

Pairwise informational fusion #

@cite{rathi-hahn-futrell-2026} eq. (12) generalizes φ to pairs of features:

φ₂(w | (f₁, f₂), ℓ, L_{-(f₁,f₂)}) := -log p(w | (f₁, f₂), ℓ, L_{-(f₁,f₂)})

Same shape, with the held-out subset being a feature pair rather than a single feature. Used in §4.4 of the paper to show that pairs of features with high informational fusion are close in optimal-ordering rankings.

Learner model abstraction #

The substrate is agnostic about the learner's implementation: @cite{rathi-hahn-futrell-2021} use an LSTM seq2seq model with attention; in principle any learner that produces a probability over candidate forms is admissible. LearnerModel n carries the prediction function as a field; the held-out corpus and target cell are explicit parameters. Specific learners (LSTM, n-gram, hand-coded morphological process libraries, ...) are out of Lean scope — they live in the paper's Python codebase.

Three formal arguments (paper Appendix A1, A2, A3) #

The paper proves three pure information-theoretic claims:

This file states A1/A2/A3 as abstract theorems on entropy parameterized over distributions on Fin n-typed random variables. The Rathi 2026 study file (Phenomena/Morphology/Studies/RathiHahnFutrell2026.lean) instantiates them on the paper's toy languages L_agg, L_fus, L_clustered with concrete finite distributions, and proves the consequences computationally (decide on rationals, no native_decide).

The substrate proofs use Core.InformationTheory.conditionalEntropy_le_entropy (Cover-Thomas 2.6.4) and mutualInformation_nonneg (Cover-Thomas 2.6.5).

A learner model: predicts probability of a candidate form at a cell index given a training corpus.

Parameterized over Form (the surface-form alphabet). Most consumers instantiate Form := String (matching ParadigmSystem n String); the Rathi 2026 toy paradigms use a small [Fintype] form alphabet so that PMF-based entropy operators apply. The LearnerModel n Form type does not require [Fintype Form]; consumers add it where needed.

The corpus argument is the held-out corpus: it is the user's responsibility to remove the target cell (or feature pair) before invoking predict. This keeps the substrate agnostic about how the holdout is computed (per-cell, per-feature-pair, etc.).

The predicted probability is a real number in [0, 1] summing to 1 over candidate forms — but no invariant is enforced at the type level. Pathological learners can violate it, in which case informationalFusion silently returns 0 for predict = 0 (per mathlib's Real.log 0 = 0 convention) instead of +∞. The PMF-returning variant (LearnerModel.toPMF, deferred to a follow-up phase that supplies the requisite normalization proofs) closes this hole.

Instances For
    noncomputable def Processing.MemorySurprisal.InformationalFusion.informationalFusion {n : } {Form : Type u_1} (M : LearnerModel n Form) (corpus : Core.Morphology.ParadigmSystem n Form) (σ : Fin n) (w : Form) :

    Informational fusion (@cite{rathi-hahn-futrell-2026} eq. 4): φ(w | σ, ℓ, L_{-σ}) := -log p(w | σ, ℓ, L_{-σ}).

    The surprisal of form w at cell σ under the learner M trained on the held-out corpus corpus (which the consumer constructs by removing σ-cells from the full paradigm). High φ means the form cannot be predicted from any subset of its other features.

    Equations
    Instances For
      theorem Processing.MemorySurprisal.InformationalFusion.informationalFusion_nonneg {n : } {Form : Type u_1} (M : LearnerModel n Form) (corpus : Core.Morphology.ParadigmSystem n Form) (σ : Fin n) (w : Form) (h_nonneg : 0 M.predict corpus σ w) (h_le_one : M.predict corpus σ w 1) :
      0 informationalFusion M corpus σ w

      For any learner whose predict returns a value in [0, 1], informationalFusion (i.e., -log predict) is non-negative. Standard fact about the surprisal of a probability — included here as proof-of-concept that the substrate's LearnerModel API supports structural reasoning. Consumers that supply non-[0,1] predict values can construct pathological cases (silently = 0 instead of +∞ when predict = 0); see the LearnerModel docstring caveat.

      Pairwise informational fusion (@cite{rathi-hahn-futrell-2026} eq. 12) #

      The pairwise generalization of informationalFusion differs from the single-feature case only in how the held-out corpus is constructed: both cells of the pair are removed before the learner is trained.

      Per the substrate's design (the consumer constructs the held-out corpus and passes it to LearnerModel.predict), pairwise fusion does not need its own function. To compute φ₂ for a feature pair (σ, τ):

      informationalFusion M (heldOutBoth corpus σ τ) σ w
      

      where heldOutBoth : ParadigmSystem n → Fin n → Fin n → ParadigmSystem n is a paradigm-restriction helper provided by the consumer (typically by masking entries of both cells). The earlier substrate version of this file exposed pairwiseFusion as a separate def, but the body was identical to informationalFusion (the τ parameter was unused), so it has been removed; future readers should use the recipe above.

      noncomputable def Processing.MemorySurprisal.InformationalFusion.averageInformationalFusion {n : } {Form : Type} [BEq Form] (M : LearnerModel n Form) (corpus : Core.Morphology.ParadigmSystem n Form) (σ : Fin n) :

      Average informational fusion across all forms expressing feature set σ.

      Equations
      • One or more equations did not get rendered due to their size.
      Instances For

        Empirical-frequency learner #

        The simplest non-trivial LearnerModel: predict probability of form w at cell σ by its empirical frequency in the corpus's cell distribution at σ. Formally:

        `predict corpus σ w = freq(w in corpus.cellDistribution σ) / total(...)`
        

        This is the "memorizer" baseline — it just reads off observed frequencies without any generalization. @cite{ackerman-malouf-2013}'s i-complexity implicitly uses this learner: H(C_i | C_j) is the conditional entropy under empirical distributions. The bridge between A&M's i-complexity and Rathi's informational fusion is precisely "average of informationalFusion empiricalLearner ... across cell pairs," but we leave the full bridge theorem for future work.

        The empirical learner returns 0 when the cell is empty or when w is not attested at the cell — this is the "totally surprising" case (informationalFusion empiricalLearner ... w = -log 0 = 0 per mathlib's Real.log convention; pathological behaviour, see API caveat above).

        noncomputable def Processing.MemorySurprisal.InformationalFusion.empiricalLearner {n : } {Form : Type} [BEq Form] :

        Empirical-frequency learner: predict by relative frequency in the cell distribution. Computable (no Real.log); returns ℝ via ℚ-then-ℝ coercion of the empirical fraction. Returns 0 when the form is unattested or the cell is empty.

        This is the smallest non-trivial witness that LearnerModel has at least one inhabitant; the full validity proof (probabilities sum to 1 and are non-negative) requires cellDistribution_nonneg and cellDistribution_sum_pos lemmas in Core/Morphology/Paradigm.lean, flagged as future work in the substrate restructure. The empirical learner's behaviour on the toy paradigms L_agg and L_fus is discussed in the Rathi 2026 study file.

        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          Paper Appendix A1: fusion can lower local surprisal #

          The paper's claim (Appendix A1): for binary features X₁, X₂ with joint distribution p, in an agglutinative mapping where Y₂ depends only on X₂, we have H[Y₂] = H[X₂]. In a fusional mapping where Y₂ depends on both X₁ and X₂, we have H[Y₂] ≤ H[X₂] whenever conditioning on X₁ reduces the entropy of (the part of Y₂ derived from) X₂.

          The Lean abstract statement: for any joint distribution and any function f : α → β with finite support, the entropy of the marginal of f(X) is bounded by the entropy of X. This is Cover-Thomas Theorem 2.8.2 (data processing inequality for functions). It already follows from conditionalEntropy_le_entropy (Theorem 2.6.4) plus H[Y | X] = 0 when Y is a deterministic function of X.

          The study file specializes this to the paper's specific 4-element joint distribution on (X₁, X₂) with the L_agg vs L_fus mappings, computing H[Y₂] for each via decide on rationals.

          A1 (abstract): conditioning reduces entropy of any random variable. Cover-Thomas 2.6.4. This is the substrate version of the paper's Appendix A1 claim that fusion of correlated features can lower marginal entropy of the derived variable.

          The canonical statement lives at `PMF.conditionalEntropy_le_entropy` —
          `joint.conditionalEntropy ≤ joint.snd.entropy` for any joint
          PMF on `(α × β)` with strict-positive marginals. Consumers wanting the
          (ι→ℝ) form construct PMFs via `PMF.ofRealWeightFn` then apply.
          
          Was previously a re-export of `Core.InformationTheory.conditionalEntropy_le_entropy`;
          Core deletion via the PMF migration removes the re-export wrapper. 
          

          A2 (abstract scaffold): fusion of independent features cannot lower long-range surprisal.

          The canonical statement lives at `PMF.mutualInformation_nonneg` (free from
          `ENNReal.toReal_nonneg` since `PMF.mutualInformation := (klDiv ...).toReal`).
          The Rathi 2026 study file uses `decide` on the paper's specific 8-element
          joint distribution to verify the inequality computationally. 
          

          Paper Appendix A3: category clustering lowers local surprisal #

          The paper's claim (§3.3, Appendix A3): in a 2-feature paradigm with mutually-exclusive feature values realized in consistent slot positions (category clustering), the marginal entropy of the slot-position random variable is lower than in a non-clustered paradigm. The proof is a direct comparison of two explicit 4-element distributions.

          The Lean abstract statement reduces to: entropy s p ≤ entropy s q when p is "more concentrated" than q in the appropriate sense (Schur concavity of entropy). For the paper's specific 4-state distribution this is verifiable by decide and lives in the study file.