Documentation

Linglib.Theories.Processing.MemorySurprisal.Basic

Memory-Surprisal Trade-off Framework #

@cite{futrell-2019} @cite{hahn-degen-futrell-2021} @cite{zaslavsky-hu-levy-2020}

Core formalization of @cite{hahn-degen-futrell-2021} "Modeling Word and Morpheme Order as an Efficient Trade-Off of Memory and Surprisal", Psychological Review 128(4):726–756.

Key Idea #

Bounded memory forces a trade-off between memory usage (H_M) and processing difficulty (surprisal S_M). At each time step, the processor stores a lossy encoding of the past. Better encoding reduces surprisal but costs more memory. The optimal trade-off forms a convex curve in (H_M, S_M) space.

Information Locality #

The curve's shape is determined by the mutual information profile I_t: how much mutual information exists between the current word and the word t steps back. Languages that concentrate predictive information locally (information locality) achieve steeper, more efficient trade-off curves.

Mathematical Structure #

The central mathematical insight is the marginal rate of substitution: at distance t, each bit of surprisal reduction costs exactly (t+1) bits of memory. This is why information locality matters — short-distance information is cheap (1 bit of memory per bit of surprisal at t=0), while long-distance information is expensive ((t+1) bits per bit at distance t). The increasing marginal rate makes the bound curve convex.

§3 proves the marginal rate theorem and its consequences from the definitions, without requiring measure theory. The bound itself (Theorem 1, §4) — that the I_t profile determines the achievable region of (H_M, S_M) pairs — is stated as comprehension postulates requiring the data processing inequality and chain rule for stationary processes.

Connection to DLM #

Information locality generalizes dependency length minimization: DLM minimizes the structural distance between syntactically related words, while information locality minimizes the information-theoretic distance at which predictive information concentrates.

Sections #

Memory Encoding #

A memory encoding maps the history of observed words to a finite memory state. At each time step t, the processor sees word w_t and updates its memory state m_t = encode(m_{t-1}, w_t). The memory's entropy H_M measures how much information the processor retains about the past.

A memory encoding compresses the past into a finite memory state.

W is the word type, Mem is the memory state type. The encoding function takes a memory state and a new word and returns the updated memory state.

  • encode : MemWMem

    Update memory state given a new word

  • initial : Mem

    Initial memory state (before any words are seen)

Instances For
    def Processing.MemorySurprisal.MemoryEncoding.summary {W Mem : Type} (me : MemoryEncoding W Mem) (history : List W) :
    Mem

    Iterate a MemoryEncoding over an entire history to produce the final memory state. This is the context-summary function that, when paired with a predictor Mem → PMF (Option W), induces a Dirac MemoryProcess (in Theories.Processing.Memory). Such a process is lossless for its own virtual LM (MemoryProcess.expectedSurprisal_eq_virtualLM_surprisal), so classical surprisal arises exactly when memory is encoded deterministically.

    Equations
    Instances For

      Mutual Information Profile #

      The mutual information profile I_t measures how much information the word at position n provides about the word at position n + t (at distance t). This determines the shape of the memory-surprisal trade-off curve.

      Key insight: I_t = S_t - S_{t+1}, where S_t is the surprisal when the processor remembers only t steps of context. So I_t measures the marginal value of remembering one more step.

      The profile connects to PMF.mutualInformation: each I_t is I(W_n; W_{n+t}), the mutual information between words at distance t in the stationary process.

      Mutual information at distance t between words.

      I_t represents how much mutual information exists between a word and the word t steps back. Higher I_t at small t means information is concentrated locally (good for memory-efficient processing).

      Stored as I_t × 1000 for decidable computation.

      • name : String

        Name for this profile (e.g., "English", "Japanese", "Baseline")

      • values : List

        I_t × 1000 for distances t = 1, 2, 3,...

      Instances For
        Equations
        • One or more equations did not get rendered due to their size.
        Instances For
          Equations
          • One or more equations did not get rendered due to their size.
          Instances For

            Sum of I_t values (total predictive information × 1000).

            Equations
            Instances For

              A profile is information-local with bound L if all predictive information lies within distance L, i.e., I_t = 0 for t ≥ L.

              This is the information-theoretic analog of a dependency-length bound: short structural dependencies make I_t decay rapidly, so a syntax with maximum dependency length L produces a profile that is local with that same bound.

              Equations
              Instances For

                Surplus surprisal when the processor remembers T steps of context: the predictive information lost beyond the memory window. Equals Σ_{t≥T} I_t (the tail sum of the profile).

                Equations
                Instances For
                  def Processing.MemorySurprisal.weightedPrefixSum :
                  List

                  Weighted prefix sum: Σ_{i<|l|} (t₀+i+1)·l[i].

                  Non-accumulator formulation for easier proofs. Structurally recursive on the list, so properties follow directly by induction.

                  Equations
                  Instances For

                    Memory cost of remembering T steps of context: Σ_{t=0}^{T-1} (t+1)·I_t. This is the prefix of the weighted sum up to distance T.

                    Equations
                    Instances For

                      Weighted sum Σ (t+1)·I_t (total memory cost × 1000).

                      This is the memory cost of storing the entire profile: memoryCost p p.values.length. Languages with lower weighted sums concentrate information at shorter distances.

                      Equations
                      Instances For

                        Marginal Analysis #

                        The trade-off bound maps memory capacity T to a point (memoryCost(T), surplusSurprisal(T)). As T increases by 1:

                        The ratio — (T+1) bits of memory per bit of surprisal reduction — is the marginal rate of substitution. It increases with T, which is why the bound curve is convex: short-distance information is cheap, long-distance information is expensive.

                        This is the mathematical core of the information locality argument. Languages that concentrate I_t at small t exploit the "cheap" region of the curve, achieving low surprisal without high memory cost.

                        theorem Processing.MemorySurprisal.surplus_step (p : MutualInfoProfile) (T : ) (hT : T < p.values.length) :

                        Surplus step: stepping from capacity T to T+1 reduces surplus surprisal by exactly I_T.

                        surplusSurprisal(T) = I_T + surplusSurprisal(T+1)

                        theorem Processing.MemorySurprisal.memoryCost_step (p : MutualInfoProfile) (T : ) (hT : T < p.values.length) :
                        p.memoryCost (T + 1) = p.memoryCost T + (T + 1) * p.values[T]

                        Memory cost step: stepping from capacity T to T+1 increases memory cost by exactly (T+1)·I_T.

                        memoryCost(T+1) = memoryCost(T) + (T+1)·I_T

                        theorem Processing.MemorySurprisal.marginal_rate (p : MutualInfoProfile) (T : ) (hT : T < p.values.length) :
                        p.memoryCost (T + 1) - p.memoryCost T = (T + 1) * (p.surplusSurprisal T - p.surplusSurprisal (T + 1))

                        Marginal rate of substitution: at distance T, each bit of surprisal reduction costs exactly (T+1) bits of memory.

                        This is the deep content of the information locality argument. The marginal memory cost is (T+1) * I_T, and the marginal surprisal reduction is I_T, so the cost ratio is T+1.

                        Short-distance information (small T) is cheap: 1 bit of memory per bit of surprisal at T=0. Long-distance information is expensive: (T+1) bits of memory per bit of surprisal at distance T.

                        This increasing cost is why information locality matters: languages that concentrate I_t at small t exploit the cheap end of the curve.

                        Surplus at capacity 0 equals total information: when the processor remembers nothing, all predictive information is lost.

                        Surplus at full capacity is zero: when the processor remembers everything, no predictive information is lost.

                        Memory cost at capacity 0 is zero: remembering nothing costs nothing.

                        Memory cost at full capacity equals the weighted sum.

                        theorem Processing.MemorySurprisal.marginal_rate_increasing (T₁ T₂ : ) (h : T₁ < T₂) :
                        T₁ + 1 < T₂ + 1

                        The marginal rate of substitution (T+1) is strictly increasing: each additional step of memory is more expensive per unit of surprisal reduction than the last. This makes the bound curve convex.

                        Information locality bounds memory cost.

                        For any information-local profile with bound L (i.e., I_t = 0 for t ≥ L), the memory-cost weighted sum is at most L · totalInfo — at most L bits of memory per bit of total predictive information.

                        This is the information-theoretic analog of dependency length minimization: when syntactic dependencies are short (capped at length L), the corresponding I_t profile is local with bound L, and the memory cost grows at most linearly in L. The memory-surprisal trade-off (@cite{hahn-degen-futrell-2021}) is thus strictly more general than DLM (@cite{futrell-mahowald-gibson-2015}): DLM optimizes the structural distance L; information locality optimizes the entire I_t profile, of which DLM is the bound-L special case.

                        Information Locality Bound #

                        Theorem 1 of @cite{hahn-degen-futrell-2021} establishes that the mutual information profile I_t determines a lower bound on the achievable memory-surprisal trade-off. Specifically, for any memory encoding with capacity T:

                        H_M ≤ Σ_{t≤T} t · I_t (memory cost of storing T steps) S_M ≥ S_∞ + Σ_{t>T} I_t (surplus surprisal from forgetting)

                        The proof requires three comprehension postulates about the underlying stationary ergodic process, which we cannot derive without measure theory:

                        1. Data Processing Inequality: Processing (lossy compression) cannot increase mutual information. If X → M → Y is a Markov chain, then I(X;Y) ≤ I(X;M). This bounds how much information the memory state can preserve about the past.

                        2. Chain Rule for Mutual Information: I(X; Y₁, Y₂) = I(X; Y₁) + I(X; Y₂|Y₁). This decomposes the total information into distance-specific contributions I_t, enabling the per-distance accounting that underlies the marginal rate theorem.

                        3. Memory Entropy Bound: H(M_t) ≥ I(M_t; W_{t-1}, ..., W_{t-T}). The memory must have enough entropy to carry the mutual information it preserves.

                        Given these postulates, the bound follows: the memory cost of storing T distances of context is Σ_{t≤T} t·I_t (from the chain rule applied iteratively), and the surplus surprisal from forgetting distances > T is Σ_{t>T} I_t (from the data processing inequality). The marginal analysis in §3 shows the consequences of this bound — the increasing cost ratio (T+1) — purely from the definitions, without needing the postulates.

                        TODO: Full derivation requires formalizing stationary ergodic processes, the data processing inequality, and the chain rule for mutual information. These are available in principle via Mathlib's measure theory, but the connection to discrete stochastic processes is substantial.

                        Comprehension postulates for the information locality bound.

                        Witnesses that a mutual information profile correctly bounds the achievable (memory, surprisal) region for a language process.

                        The Achievable predicate abstracts over "the pair (H_M, S_M) can be realized by some memory encoding of the underlying process." The bound states: any achievable pair with memory at most memoryCost(T) must have surprisal at least S_∞ + surplusSurprisal(T). Geometrically, all achievable pairs lie above the bound curve parameterized by T.

                        • The mutual information profile derived from the process

                        • entropyRate1000 :

                          Entropy rate of the process (S_∞ × 1000)

                        • Achievable : Prop

                          Predicate: (H_M, S_M) is achievable by some memory encoding

                        • bound (T H_M S_M : ) : self.Achievable H_M S_MH_M self.profile.memoryCost TS_M self.entropyRate1000 + self.profile.surplusSurprisal T

                          The bound: any achievable pair lies above the trade-off curve. If memory ≤ memoryCost(T), then surprisal ≥ S_∞ + surplusSurprisal(T). Derived from DPI (bounding surprisal from below) and chain rule (decomposing memory cost into per-distance contributions).

                        Instances For

                          Trade-off Curve #

                          The memory-surprisal trade-off curve plots (H_M, S_M) pairs achievable by different memory encodings. Languages with more efficient word orders have curves closer to the origin (lower AUC).

                          Values are stored as Nat × 1000 for decidable computation.

                          A single point on the memory-surprisal trade-off curve.

                          memoryBits1000 = H_M × 1000 (memory in millibits) surprisal1000 = S_M × 1000 (surprisal in millibits)

                          • memoryBits1000 :
                          • surprisal1000 :
                          Instances For
                            Equations
                            • One or more equations did not get rendered due to their size.
                            Instances For
                              Equations
                              • One or more equations did not get rendered due to their size.
                              Instances For

                                A memory-surprisal trade-off curve for a language or baseline.

                                • name : String
                                • points : List TradeoffPoint

                                  Points ordered by increasing memory (decreasing surprisal)

                                Instances For
                                  Equations
                                  • One or more equations did not get rendered due to their size.
                                  Instances For
                                    Equations
                                    • One or more equations did not get rendered due to their size.
                                    Instances For

                                      Approximate area under the trade-off curve via trapezoidal rule.

                                      AUC × 1000000 (millibits²). Lower AUC = more efficient trade-off. The curve should have points ordered by increasing memory.

                                      Equations
                                      • One or more equations did not get rendered due to their size.
                                      Instances For

                                        The efficient trade-off hypothesis: a real language's AUC is smaller than its random baseline's AUC.

                                        Equations
                                        Instances For

                                          The trade-off bound curve implied by the mutual information profile (Theorem 1). Point T maps to (memoryCost(T), surplusSurprisal(T)) for T = 0, 1, ..., n.

                                          Equations
                                          • One or more equations did not get rendered due to their size.
                                          Instances For

                                            Bridge: Memory-Surprisal ↔ Rate-Distortion #

                                            The memory-surprisal trade-off is structurally analogous to rate-distortion theory:

                                            Memory-SurprisalRate-Distortion
                                            Memory H_MRate R
                                            Surprisal S_MDistortion D
                                            Trade-off curveRD curve
                                            Info localityStructural constraint

                                            Both characterize optimal lossy compression: the memory encoding compresses the past (lossy), and the trade-off curve is the achievable region boundary.

                                            TODO: Formal proof requires showing that the memory-surprisal trade-off curve equals the rate-distortion function for the appropriate source and distortion measure. See @cite{hahn-degen-futrell-2021} SI §1.3.

                                            Bridge: Memory ↔ Processing Locality #

                                            The memory dimension H_M connects to the locality dimension of ProcessingModel.ProcessingProfile: higher memory means the processor can track longer-distance dependencies, which is equivalent to tolerating higher locality costs.

                                            Map a memory capacity (in bits × 1000) to a processing profile.

                                            Higher memory capacity → processor can handle longer dependencies → maps to locality (the maximum dependency distance the processor can comfortably handle).

                                            Equations
                                            Instances For

                                              More memory → can handle longer locality.

                                              Bridge: Information Locality ↔ Dependency Locality #

                                              Information locality (I_t concentrated at small t) generalizes dependency locality (short structural distances between related words). When syntactic dependencies are short, the words that carry predictive information about each other are close together, making I_t decay rapidly.

                                              The structural content of this bridge is MutualInfoProfile.weightedSum_le_of_isLocal (§3): for any profile that is information-local with bound L, the memory cost is bounded by L · totalInfo. DLM (which caps L structurally) is therefore the dependency-graph-level instantiation of information-locality optimization. See @cite{futrell-2019} and @cite{hahn-degen-futrell-2021} §2.3 for the broader empirical case.

                                              Universal length bound on memory cost (vacuous case of weightedSum_le_of_isLocal). For any profile, the weighted sum is bounded by length · totalInfo. The interesting case is when the profile is information-local with L < length, giving a tighter bound.

                                              Bridge: Memory-Surprisal ↔ Generalised Surprisal #

                                              The memory-surprisal trade-off operates at the standard surprisal configuration: negLog warping, indicator scoring, horizon 1, predictive level. The trade-off curve varies memory capacity while holding the prediction resolution fixed.

                                              @cite{giulianelli-etal-2026} generalizes this by also varying the resolution parameters (forecast horizon h and representational level l), showing that different psycholinguistic measures are best predicted at different resolutions. The memory-surprisal trade-off is the special case where prediction resolution is held constant and only the memory budget varies.

                                              The generalised surprisal configuration used by the memory-surprisal trade-off: standard surprisal (negLog × indicator × h=1 × predictive).

                                              The trade-off curve parametrizes over memory encodings while holding this resolution fixed. IAS extends this by also parametrizing over the prediction resolution (horizon and representational level).

                                              Equations
                                              Instances For

                                                Bridge to NoisyChannel: deterministic encoders are Dirac MemoryProcesses #

                                                The MemoryEncoding of @cite{hahn-degen-futrell-2021} is a deterministic context-summary (Mem × W) → Mem (plus an initial state). Paired with a predictor Mem → PMF (Option W), it induces a MemoryProcess (in Theories.Processing.NoisyChannel) whose encoder is a Dirac at the iterated memory state. The deterministic encoder is exactly the lossless special case that the @cite{futrell-gibson-levy-2020} MemoryProcess substrate generalizes — making the connection true by construction.

                                                noncomputable def Processing.MemorySurprisal.MemoryEncoding.toMemoryProcess {W Mem : Type} (me : MemoryEncoding W Mem) (predict : MemPMF (Option W)) :

                                                Pair a deterministic MemoryEncoding with a predictor to obtain a MemoryProcess. The encoder is the Dirac at the iterated context summary; the predictor is exposed unchanged.

                                                Equations
                                                Instances For
                                                  theorem Processing.MemorySurprisal.MemoryEncoding.toMemoryProcess_isDirac {W Mem : Type} (me : MemoryEncoding W Mem) (predict : MemPMF (Option W)) :

                                                  The induced MemoryProcess is Dirac at me.summary, by construction.

                                                  theorem Processing.MemorySurprisal.MemoryEncoding.toMemoryProcess_expectedSurprisal_eq_virtualLM_surprisal {W Mem : Type} (me : MemoryEncoding W Mem) (predict : MemPMF (Option W)) (c : List W) (w : W) :

                                                  Lossless reduction transported. A MemoryEncoding paired with a predictor recovers classical surprisal under its induced language model (the @cite{futrell-gibson-levy-2020} §3.5.1 reduction, applied here).