Memory Processes (Noisy-Channel Substrate, Memory-Side) #
@cite{futrell-gibson-levy-2020}
A MemoryProcess is the memory-side noisy-channel substrate of
@cite{futrell-gibson-levy-2020}'s lossy-context surprisal: a processor that
compresses its observed history into a (possibly noisy) memory state, then
predicts the next symbol from that memory state alone. This is the
context-noise specialization of the broader noisy-channel comprehension
posture (cf. Gibson-Bergen-Piantadosi 2013 for input-string noise; Levy
2011 for noise on both the input and the memory of context).
Following the paper's claims (§3.1, Claims 1–4):
- Incrementality of memory (Claim 1) — the encoder
encode : List Voc → PMF Memmaps each observed history to a distribution over memory states. (FGL state Claim 1 in terms of a recursive stepm : R × W → R; the bundled iterated form here discharges incrementality silently and is a known schema simplification.) - Linguistic knowledge (Claim 2) — the predictor
predict : Mem → PMF (Option Voc)maps each memory state to a next-symbol distribution (withnone= end-of-string). - Inaccessibility of context (Claim 3) — the predictor sees only the memory state, never the underlying history. This is what makes the model lossy.
- Linking hypothesis (Claim 4) — processing difficulty for the next
symbol is
expectedSurprisal(Eq. 3 of the paper), the expectation of-log P(w | m)overm ~ encode(c).
This file provides the abstract type and the master cost function. Specialisations (lossless = classical surprisal, n-gram, erasure noise, cue-based retrieval) live in sibling files.
The encoder and predictor are mathlib PMF kernels (countable support,
ℝ≥0∞-valued), so neither Voc nor Mem need to be Fintype. Expected
sums are taken as tsum (∑') which collapses to Finset.sum whenever a
Fintype instance is in scope.
Main definitions #
MemoryProcess Voc Mem— pair of(encode, predict)kernelsMemoryProcess.perStateSurprisal—-log P(w | m), the inner factor of Eq. 2/Eq. 3MemoryProcess.expectedSurprisal— Eq. 3:E_{m ~ encode(c)}[-log P(w | m)]MemoryProcess.marginalProb— forward-marginalP(w | c) = Σ_m P(m|c)·P(w|m)under the joint kernel(encode, predict)(the paper's noisy-channel inversion in §3.3 / Eqs. 4–9 goes the other direction, summing over contexts; the forward marginal here is implicit in the setup but not separately numbered)MemoryProcess.IsDirac— encoder is a point mass atf c(lossless)
Main theorem #
expectedSurprisal_of_dirac— when the encoder is a Dirac atf c, expected surprisal collapses to per-state surprisal atf c. This is the structural identity that turns the lossless case into classical surprisal inLossyContext.lean.
A processing architecture with a memory bottleneck.
encode lossily compresses the observed history into a distribution over
memory states; predict gives the next-symbol distribution from a memory
state alone. The memory state mediates all information flow from the
past to the prediction (the inaccessibility-of-context claim of
@cite{futrell-gibson-levy-2020} §3.1, Claim 3).
- encode : List Voc → PMF Mem
Lossy encoder: history → distribution over memory states.
- predict : Mem → PMF (Option Voc)
Predictor: memory state → next-symbol distribution.
Instances For
Per-state surprisal: D_lc(w | m) = -log P(w | m).
This is the inner factor that appears inside the paper's Eq. 2 (linking hypothesis as proportionality) and Eq. 3 (expected-surprisal expectation). @cite{futrell-gibson-levy-2020} does not give it a standalone equation number, but factoring it out is convenient for the Dirac-collapse proof below.
Equations
- mp.perStateSurprisal m w = -Real.log ((mp.predict m) (some w)).toReal
Instances For
Expected surprisal under the lossy memory process:
D_lc(w | c) = E_{m ~ encode(c)}[-log P(w | m)].
(@cite{futrell-gibson-levy-2020} Eq. 3.)
This is the master cost function from which classical surprisal,
n-gram surprisal, and the cue-based retrieval cost all derive as
instantiations of (encode, predict).
Equations
- mp.expectedSurprisal c w = ∑' (m : Mem), ((mp.encode c) m).toReal * mp.perStateSurprisal m w
Instances For
Forward-marginal next-symbol probability under the memory process:
P(w | c) = Σ_m P(m | c) · P(w | m).
Computed as (encode c).bind predict evaluated at some w — the
mathlib PMF monad bind is exactly the marginal of the joint kernel.
This forward marginal is implicit in @cite{futrell-gibson-levy-2020}'s
setup but not separately numbered. The paper's noisy-channel inversion
in §3.3 (Eqs. 4–9) goes the other direction: marginalizing over
contexts c to compute p(w | r) from a fixed memory state r. The
forward kernel below — marginalizing over memory states m given a
context c — is the dual operation.
Note: -log marginalProb (the surprisal of the marginal) is not in
general equal to expectedSurprisal (the expected surprisal). The two
agree iff the encoder is a Dirac (no information loss); otherwise Jensen
gives expectedSurprisal ≥ -log marginalProb.
Equations
- mp.marginalProb c w = ((mp.encode c).bind mp.predict) (some w)
Instances For
The memory process is Dirac at f if the encoder concentrates
all mass on f c for every history c. This is the lossless
("perfect memory") regime: the memory state is a deterministic
function of the history.
(@cite{futrell-gibson-levy-2020} §3.5.1: classical surprisal arises exactly in this case.)
Instances For
Dirac collapse. When the encoder is a Dirac at f c, expected
surprisal reduces to per-state surprisal evaluated at f c. The
expectation has no spread to average over.
This is the structural identity that lets us recover classical
surprisal as the special case of expectedSurprisal with lossless
memory (see LossyContext.lean's
expectedSurprisal_eq_surprisal_of_lossless).
Constant-encoder limit (regression to prior). When the encoder
ignores its input and concentrates on a single memory state m₀, expected
surprisal collapses to per-state surprisal at m₀. The predictor's
distribution at m₀ is then the comprehender's prior (no context-derived
information) — so per-state surprisal here is the unigram log-prob.
This is the opposite extreme of expectedSurprisal_eq_surprisal_of_lossless:
where lossless memory recovers classical surprisal, fully lossy memory
(all context information discarded) recovers the unigram prior. Together
they bracket the lossy-context regime — @cite{futrell-gibson-levy-2020}
§3.4.2 ("when all information about a true context is lost, the difficulty
of comprehending a word is given exactly by its log prior unigram
probability").
Jensen bound: expected surprisal lower-bounds marginal surprisal.
The expected surprisal under the lossy memory process is always at least
the surprisal of the forward marginal (encode c).bind predict.
Equivalently: the per-state-then-expectation order yields a higher
processing cost than the marginal-then-surprisal order. The two agree iff
the encoder is a Dirac (no information loss); in general Jensen's
inequality applied to the concave function log gives the strict direction.
This is the substantive content of "lossy memory makes things harder": @cite{futrell-gibson-levy-2020} Suppl. Mat. A's memory distortion bound rests on this Jensen gap.
Hypotheses. [Fintype Mem] reduces tsum to Finset.sum so that
mathlib's ConcaveOn.le_map_sum applies directly. h_pos requires the
per-state predicted probability of w to be strictly positive at every
memory state — needed because mathlib's Real.log 0 = 0 convention
breaks the bound at zero entries: a memory state with positive encode
mass but zero predict mass for w contributes 0 to expected
surprisal under the convention but a positive contribution to the
marginal, so −log marginalProb could exceed expectedSurprisal.