Futrell, Gibson & Levy (2020): lossy-context surprisal #
[FGL20] (Cognitive Science 44, e12814) unifies expectation-
based and memory-based theories of processing difficulty: the difficulty of a
word is its expected surprisal given a lossy memory representation of the
context (Claims 1–4, eq. (3)). Processing.Memory.Channel formalizes the
architecture (MemoryProcess, expectedSurprisal = eq. (3)) and
Processing.Memory.LossyContext its lossless regime (§3.5.1); this file
proves the paper's §5 result — information locality — in the
single-dependency configuration, where the paper's first-order eq. (11)
(Supplementary Material C) holds exactly: under erasure noise, expected
surprisal is h(w) − (1 − e)·pmi(w; y), the excess difficulty over plain
surprisal is exactly e · pmi (eq. (12)), and the sign of the distance
effect is the sign of the pmi — locality for positively associated words,
anti-locality for negatively associated ones. Structural forgetting (§4) is
parameter-space simulation (Figs. 3–4: forgetting iff the verb-final
relative-clause rate f is low, as for English f ≈ 0.2 but not German
f = 1) and stays in prose.
Main definitions #
pmi— pointwise mutual information of the next word with a one-word context (§5.1.2), relative to the model's own empty-context prior.erasure— the erasure-noise memory process (§5.1.3): the context's head word survives with probability1 − e; an erased word reads as the empty context.
Main results #
surprisal_eq_sub_pmi— eq. (10): conditional surprisal is unconditional surprisal minus pmi.expectedSurprisal_erasure— eq. (11), exact single-dependency form:D_lc = h(w) − (1 − e) · pmi.expectedSurprisal_erasure_sub_surprisal— eq. (12): the excess difficulty over plain surprisal ise · pmi.locality,antilocality— §5.1.4: under progressive noise (more distant ⇒ largere), difficulty is monotone in the erasure rate, increasing when0 ≤ pmiand decreasing whenpmi ≤ 0.erasure_zero,erasure_one— the brackets: no erasure recovers plain surprisal (§3.5.1); certain erasure recovers the prior (§3.4.2).mutualInformation_memory_le— §3.2: the data processing inequality as the constraint on all admissible noise distributions.bayesDifficulty_memJoint_sub_eq,bayesDifficulty_le_memJoint— the average form (§3.4.1, Supp. A, at the §3.3 Bayes-optimal comprehender): expected difficulty under lossy memory exceeds expected difficulty under veridical context by exactly the predictive information lost to memory,I(W;C) − I(W;M) ≥ 0.
Pointwise mutual information of the next word w with the one-word
context y (§5.1.2), relative to the model's empty-context prior.
Equations
- FutrellGibsonLevy2020.pmi L y w = Real.log ((L.nextProb [y] w).toReal / (L.nextProb [] w).toReal)
Instances For
Eq. (10): conditional surprisal decomposes as unconditional surprisal minus pointwise mutual information.
The erasure-noise memory process (§5.1.3): the memory retains the
context's head word with probability 1 − e and erases it with
probability e; the predictor reads a retained word as a one-word
context and an erased one as the empty context.
Equations
- One or more equations did not get rendered due to their size.
Instances For
The exact single-dependency form of eq. (11): under erasure noise, the lossy-context difficulty is the unconditional surprisal minus the surviving fraction of the pmi.
Eq. (12): the excess difficulty of erasure-noise processing over plain surprisal is exactly the erased fraction of the pmi.
Information locality (§5.1.4): under progressive noise — a more distant context word has a larger erasure rate — difficulty increases with distance whenever the words are positively associated.
Anti-locality (§5.1.4, cf. the Konieczny effects of §2): when the words are negatively associated, losing the context word lowers difficulty, so difficulty decreases with distance.
No erasure recovers plain surprisal (§3.5.1's special case, at the toy
configuration; the general statement is
Processing.NoisyChannel.expectedSurprisal_eq_surprisal_of_lossless).
Certain erasure recovers the prior (§3.4.2: "regression to prior
expectations"; the general statement is
Processing.NoisyChannel.MemoryProcess.expectedSurprisal_of_constantEncoder).
§3.2's constraint on noise distributions, as the mutual-information form of the data processing inequality: a memory representation generated from the context (Claim 3) carries no more information about the next word than the context itself, whatever the noise distribution.
The average form (§3.4.1, Supplementary Material A) #
Averaged over contexts, the difficulty of the Bayes-optimal comprehender (§3.3, eqs. (4)–(9)) under lossy memory exceeds its difficulty under veridical context by exactly the predictive information lost to memory.
The (word, memory) joint induced by passing the context coordinate through the memory encoder (Claims 1 and 3).
Equations
- FutrellGibsonLevy2020.memJoint J mem = J.bind fun (x : W × C) => PMF.map (Prod.mk x.1) (mem x.2)
Instances For
Expected difficulty of the Bayes-optimal comprehender (§3.3): the expected
surprisal of predicting the first coordinate from the second under the
conditional distribution PMF.cond.
Equations
- FutrellGibsonLevy2020.bayesDifficulty G = ∑ x : α × β, (G x).toReal * -Real.log ((G.cond x.2) x.1).toReal
Instances For
The Bayes-optimal difficulty is the conditional entropy H(W | ·):
the entropy chain rule read as an expected surprisal.
The average form of information locality: the expected excess difficulty of lossy-memory comprehension over veridical-context comprehension is exactly the predictive information lost to memory.
Lossy memory cannot make comprehension easier on average: the expected
Bayes-optimal difficulty under memory is at least that under veridical
context (the §3.4.1 deduction, with the gap given by
bayesDifficulty_memJoint_sub_eq and its sign by the data processing
inequality).