Documentation

Linglib.Studies.FutrellGibsonLevy2020

Futrell, Gibson & Levy (2020): lossy-context surprisal #

[FGL20] (Cognitive Science 44, e12814) unifies expectation- based and memory-based theories of processing difficulty: the difficulty of a word is its expected surprisal given a lossy memory representation of the context (Claims 1–4, eq. (3)). Processing.Memory.Channel formalizes the architecture (MemoryProcess, expectedSurprisal = eq. (3)) and Processing.Memory.LossyContext its lossless regime (§3.5.1); this file proves the paper's §5 result — information locality — in the single-dependency configuration, where the paper's first-order eq. (11) (Supplementary Material C) holds exactly: under erasure noise, expected surprisal is h(w) − (1 − e)·pmi(w; y), the excess difficulty over plain surprisal is exactly e · pmi (eq. (12)), and the sign of the distance effect is the sign of the pmi — locality for positively associated words, anti-locality for negatively associated ones. Structural forgetting (§4) is parameter-space simulation (Figs. 3–4: forgetting iff the verb-final relative-clause rate f is low, as for English f ≈ 0.2 but not German f = 1) and stays in prose.

Main definitions #

pmi — pointwise mutual information of the next word with a one-word context (§5.1.2), relative to the model's own empty-context prior.
erasure — the erasure-noise memory process (§5.1.3): the context's head word survives with probability 1 − e; an erased word reads as the empty context.

Main results #

surprisal_eq_sub_pmi — eq. (10): conditional surprisal is unconditional surprisal minus pmi.
expectedSurprisal_erasure — eq. (11), exact single-dependency form: D_lc = h(w) − (1 − e) · pmi.
expectedSurprisal_erasure_sub_surprisal — eq. (12): the excess difficulty over plain surprisal is e · pmi.
locality, antilocality — §5.1.4: under progressive noise (more distant ⇒ larger e), difficulty is monotone in the erasure rate, increasing when 0 ≤ pmi and decreasing when pmi ≤ 0.
erasure_zero, erasure_one — the brackets: no erasure recovers plain surprisal (§3.5.1); certain erasure recovers the prior (§3.4.2).
mutualInformation_memory_le — §3.2: the data processing inequality as the constraint on all admissible noise distributions.
bayesDifficulty_memJoint_sub_eq, bayesDifficulty_le_memJoint — the average form (§3.4.1, Supp. A, at the §3.3 Bayes-optimal comprehender): expected difficulty under lossy memory exceeds expected difficulty under veridical context by exactly the predictive information lost to memory, I(W;C) − I(W;M) ≥ 0.

noncomputable def FutrellGibsonLevy2020.pmi {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) (y w : Voc) :

ℝ

Pointwise mutual information of the next word w with the one-word context y (§5.1.2), relative to the model's empty-context prior.

Equations

FutrellGibsonLevy2020.pmi L y w = Real.log ((L.nextProb [y] w).toReal / (L.nextProb [] w).toReal)

Instances For

theorem FutrellGibsonLevy2020.surprisal_eq_sub_pmi {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) {y w : Voc} (h0 : L.nextProb [] w ≠ 0) (hy : L.nextProb [y] w ≠ 0) :

L.surprisal [y] w = L.surprisal [] w - pmi L y w

Eq. (10): conditional surprisal decomposes as unconditional surprisal minus pointwise mutual information.

noncomputable def FutrellGibsonLevy2020.erasure {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] (e : NNReal) (he : e ≤ 1) :

Processing.NoisyChannel.MemoryProcess Voc (Option Voc)

The erasure-noise memory process (§5.1.3): the memory retains the context's head word with probability 1 − e and erases it with probability e; the predictor reads a retained word as a one-word context and an erased one as the empty context.

Equations

One or more equations did not get rendered due to their size.

Instances For

theorem FutrellGibsonLevy2020.erasure_encode_apply {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] {e : NNReal} {y : Voc} (he : e ≤ 1) (m : Option Voc) :

((erasure L e he).encode [y]) m = if m = none then ↑e else if m = some y then 1 - ↑e else 0

theorem FutrellGibsonLevy2020.expectedSurprisal_erasure {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] {e : NNReal} {y w : Voc} (he : e ≤ 1) (h0 : L.nextProb [] w ≠ 0) (hy : L.nextProb [y] w ≠ 0) :

(erasure L e he).expectedSurprisal [y] w = L.surprisal [] w - (1 - ↑e) * pmi L y w

The exact single-dependency form of eq. (11): under erasure noise, the lossy-context difficulty is the unconditional surprisal minus the surviving fraction of the pmi.

theorem FutrellGibsonLevy2020.expectedSurprisal_erasure_sub_surprisal {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] {e : NNReal} {y w : Voc} (he : e ≤ 1) (h0 : L.nextProb [] w ≠ 0) (hy : L.nextProb [y] w ≠ 0) :

(erasure L e he).expectedSurprisal [y] w - L.surprisal [y] w = ↑e * pmi L y w

Eq. (12): the excess difficulty of erasure-noise processing over plain surprisal is exactly the erased fraction of the pmi.

theorem FutrellGibsonLevy2020.locality {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] {e e' : NNReal} {y w : Voc} (h : e ≤ e') (he' : e' ≤ 1) (hpmi : 0 ≤ pmi L y w) (h0 : L.nextProb [] w ≠ 0) (hy : L.nextProb [y] w ≠ 0) :

(erasure L e ⋯).expectedSurprisal [y] w ≤ (erasure L e' he').expectedSurprisal [y] w

Information locality (§5.1.4): under progressive noise — a more distant context word has a larger erasure rate — difficulty increases with distance whenever the words are positively associated.

theorem FutrellGibsonLevy2020.antilocality {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] {e e' : NNReal} {y w : Voc} (h : e ≤ e') (he' : e' ≤ 1) (hpmi : pmi L y w ≤ 0) (h0 : L.nextProb [] w ≠ 0) (hy : L.nextProb [y] w ≠ 0) :

(erasure L e' he').expectedSurprisal [y] w ≤ (erasure L e ⋯).expectedSurprisal [y] w

Anti-locality (§5.1.4, cf. the Konieczny effects of §2): when the words are negatively associated, losing the context word lowers difficulty, so difficulty decreases with distance.

theorem FutrellGibsonLevy2020.erasure_zero {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] {y w : Voc} (h0 : L.nextProb [] w ≠ 0) (hy : L.nextProb [y] w ≠ 0) :

(erasure L 0 ⋯).expectedSurprisal [y] w = L.surprisal [y] w

No erasure recovers plain surprisal (§3.5.1's special case, at the toy configuration; the general statement is Processing.NoisyChannel.expectedSurprisal_eq_surprisal_of_lossless).

theorem FutrellGibsonLevy2020.erasure_one {Voc : Type u_1} (L : Processing.LanguageModel.LangModel Voc) [DecidableEq Voc] {y w : Voc} (h0 : L.nextProb [] w ≠ 0) (hy : L.nextProb [y] w ≠ 0) :

(erasure L 1 ⋯).expectedSurprisal [y] w = L.surprisal [] w

Certain erasure recovers the prior (§3.4.2: "regression to prior expectations"; the general statement is Processing.NoisyChannel.MemoryProcess.expectedSurprisal_of_constantEncoder).

theorem FutrellGibsonLevy2020.mutualInformation_memory_le {W : Type u_2} {C : Type u_3} {M : Type u_4} [Fintype W] [Fintype C] [Fintype M] [MeasurableSpace W] [MeasurableSpace C] [MeasurableSpace M] [MeasurableSingletonClass W] [MeasurableSingletonClass C] [MeasurableSingletonClass M] [DecidableEq W] [DecidableEq C] (joint : PMF (W × C)) (mem : C → PMF M) :

(joint.bind fun (x : W × C) => PMF.map (Prod.mk x.1) (mem x.2)).mutualInformation ≤ joint.mutualInformation

§3.2's constraint on noise distributions, as the mutual-information form of the data processing inequality: a memory representation generated from the context (Claim 3) carries no more information about the next word than the context itself, whatever the noise distribution.

The average form (§3.4.1, Supplementary Material A) #

Averaged over contexts, the difficulty of the Bayes-optimal comprehender (§3.3, eqs. (4)–(9)) under lossy memory exceeds its difficulty under veridical context by exactly the predictive information lost to memory.

noncomputable def FutrellGibsonLevy2020.memJoint {W : Type u_2} {C : Type u_3} {M : Type u_4} (J : PMF (W × C)) (mem : C → PMF M) :

PMF (W × M)

The (word, memory) joint induced by passing the context coordinate through the memory encoder (Claims 1 and 3).

Equations

FutrellGibsonLevy2020.memJoint J mem = J.bind fun (x : W × C) => PMF.map (Prod.mk x.1) (mem x.2)

Instances For

noncomputable def FutrellGibsonLevy2020.bayesDifficulty {α : Type u_5} {β : Type u_6} [Fintype α] [Fintype β] [DecidableEq β] (G : PMF (α × β)) :

ℝ

Expected difficulty of the Bayes-optimal comprehender (§3.3): the expected surprisal of predicting the first coordinate from the second under the conditional distribution PMF.cond.

Equations

FutrellGibsonLevy2020.bayesDifficulty G = ∑ x : α × β, (G x).toReal * -Real.log ((G.cond x.2) x.1).toReal

Instances For

theorem FutrellGibsonLevy2020.bayesDifficulty_eq {α : Type u_5} {β : Type u_6} [Fintype α] [Fintype β] [DecidableEq β] (G : PMF (α × β)) :

bayesDifficulty G = G.entropy - G.snd.entropy

The Bayes-optimal difficulty is the conditional entropy H(W | ·): the entropy chain rule read as an expected surprisal.

theorem FutrellGibsonLevy2020.bayesDifficulty_memJoint_sub_eq {W : Type u_2} {C : Type u_3} {M : Type u_4} [Fintype W] [Fintype C] [Fintype M] [MeasurableSpace W] [MeasurableSpace C] [MeasurableSpace M] [MeasurableSingletonClass W] [MeasurableSingletonClass C] [MeasurableSingletonClass M] [DecidableEq W] [DecidableEq C] [DecidableEq M] (J : PMF (W × C)) (mem : C → PMF M) (hW : ∀ (w : W), 0 < J.fst.toRealFn w) (hC : ∀ (c : C), 0 < J.snd.toRealFn c) (hM : ∀ (m : M), 0 < (memJoint J mem).snd.toRealFn m) :

bayesDifficulty (memJoint J mem) - bayesDifficulty J = J.mutualInformation - (memJoint J mem).mutualInformation

The average form of information locality: the expected excess difficulty of lossy-memory comprehension over veridical-context comprehension is exactly the predictive information lost to memory.

theorem FutrellGibsonLevy2020.bayesDifficulty_le_memJoint {W : Type u_2} {C : Type u_3} {M : Type u_4} [Fintype W] [Fintype C] [Fintype M] [MeasurableSpace W] [MeasurableSpace C] [MeasurableSpace M] [MeasurableSingletonClass W] [MeasurableSingletonClass C] [MeasurableSingletonClass M] [DecidableEq W] [DecidableEq C] [DecidableEq M] (J : PMF (W × C)) (mem : C → PMF M) (hW : ∀ (w : W), 0 < J.fst.toRealFn w) (hC : ∀ (c : C), 0 < J.snd.toRealFn c) (hM : ∀ (m : M), 0 < (memJoint J mem).snd.toRealFn m) :

bayesDifficulty J ≤ bayesDifficulty (memJoint J mem)

Lossy memory cannot make comprehension easier on average: the expected Bayes-optimal difficulty under memory is at least that under veridical context (the §3.4.1 deduction, with the gap given by bayesDifficulty_memJoint_sub_eq and its sign by the data processing inequality).