Cancellation theorems for RSA models with noisy observation #
@cite{goodman-stuhlmuller-2013}'s cancellation principle (informal): as a speaker's observation kernel becomes noisier, the listener's posterior given an utterance moves closer to the prior.
This file states the structural information-theoretic content that follows from the data processing inequality (DPI), and honestly scopes what's universal vs paper-specific.
What IS structurally provable (Path B from session audit) #
- Observation-level cancellation (
mutualInformation_state_obs_le):MI(state; obs_noisy) ≤ MI(state; obs_informative)whenκ_noisy = κ_informative.bind noise. Direct corollary ofPMF.klDiv_bind_leapplied per-state with the noise kernel.
What IS NOT structurally provable #
Utterance-level cancellation
MI(state; utt_noisy) ≤ MI(state; utt_informative): this is not a clean DPI corollary. The Markov chainstate → Y_i → (U_i, U_n) → U_ngivesMI(state; U_n) ≤ MI(state; Y_i), but no direct comparison betweenU_nandU_ibecauseU_iandU_nshareY_ias a common parent — there's no chainstate → U_i → U_n. May or may not hold depending onS1gshape.Per-(world-pair) ordering (e.g. GS2013's "L1(s2 | u) > L1(s3 | u) weakens with noise"): this is per-paper, depends on lex shape, and requires numerical evaluation.
Architectural framing #
- Universal substrate (here): the obs-level MI cancellation — a real theorem any RSA paper with a noisy obs chain can use.
- Per-paper findings: numerical evaluations of model behavior, illustrations of how the structural inequality manifests for specific lex / kernel shapes.
- Anti-pattern: claiming a single "cancellation theorem" that yields all the per-cell numerical orderings as corollaries. (No such theorem; the per-paper orderings depend on more than just kernel informativity.)
§1. Observation-level mutual information #
Define MI(state; obs) = ∑_s prior(s) · KL(κ s ‖ marg). This is the standard
chain-rule decomposition (Cover-Thomas Eq 2.61): the average information about
the state contained in an observation, equivalent to KL(joint ‖ product).
Working with this form rather than PMF.mutualInformation directly because the
per-state decomposition is what makes DPI applicable: each KL(κ_n s ‖ marg_n)
vs KL(κ_i s ‖ marg_i) term reduces by klDiv_bind_le per-state with kernel
noise : Obs → PMF Obs. The state component never enters the kernel, so the
state-preservation support issue vanishes.
The observation marginal marg_o(o) = ∑_s prior(s) · κ(s)(o).
Equations
- RSA.obsMarginal prior κ = prior.bind κ
Instances For
Per-state-decomposed mutual information between state and observation:
MI(state; obs) = ∑_s prior(s) · KL(κ s ‖ marg).
This is the conditional-relative-entropy form (Cover-Thomas Eq 2.61). For any
joint J(s, o) = prior(s) · κ(s)(o) with marginal marg(o) = ∑_s J(s, o),
the standard MI KL(J ‖ prior × marg) equals this per-state weighted sum.
The decomposition is what makes the DPI argument tractable.
Equations
- RSA.mutualInfoStateObs prior κ = ∑ s : W, prior s * (κ s).klDiv (RSA.obsMarginal prior κ)
Instances For
§2. Observation-level DPI cancellation #
Theorem: if κ_n s = (κ_i s).bind noise for all states s (the noisy
observation kernel is a post-processing of the informative one through some
noise channel), then MI(state; obs_n) ≤ MI(state; obs_i).
The proof applies PMF.klDiv_bind_le per-state. For each s, we have
κ_n s = (κ_i s).bind noise AND obsMarginal prior κ_n = (obsMarginal prior κ_i).bind noise
(since bind distributes over outer bind). The per-state KL inequality lifts
to the prior-weighted sum.
Observation marginal under noisy kernel = noise-bind of informative marginal. This is the fact that lets the per-state DPI lift to the marginal level.
Observation-level cancellation theorem (DPI form).
If the noisy observation kernel is a post-processing of the informative kernel
through a noise channel (κ_n s = (κ_i s).bind noise), then the mutual
information between state and observation decreases.
Reduces to PMF.klDiv_bind_le applied per-state. The state component never
enters the noise kernel, so the technical support precondition on klDiv_bind_le
applies cleanly to noise : Obs → PMF Obs.
The hypothesis h_ac is per-state absolute continuity of κ_i s w.r.t.
obsMarginal prior κ_i; h_marg_pos ensures the obs marginal has full support;
h_noise_pos ensures noise has full support compatible with the marginals (the
standard DPI support precondition).
This is the structural information-theoretic content of "less informative observation kernel → less information about state in the observation".
§3. Honest scoping note #
The utterance-level form (MI(state; utt_n) ≤ MI(state; utt_i)) does NOT follow
from §2 by composing with S1g : Obs → PMF U. The natural composition gives:
MI(state; utt_x) ≤ MI(state; obs_x) ≤ MI(state; obs_i) for x ∈ {n, i}
So both MI(state; utt_n) and MI(state; utt_i) are bounded by MI(state; obs_i)
— no direct comparison between them. The Markov chain state → Y_i → (U_i, U_n)
gives MI(state; U_n) ≤ MI(state; Y_i) but there's no chain state → U_i → U_n
because U_n depends on Y_i directly via noise(Y_i), not just on U_i.
GS2013's per-(world-pair) findings (e.g. "L1(s2 | u) > L1(s3 | u) weakens at
lower access") are NOT corollaries of any clean MI cancellation. They are
specific numerical evaluations of the model — see
Phenomena/ScalarImplicatures/Studies/GoodmanStuhlmuller2013PMF.lean for the
illustrative computations.
The proper architectural framing: §2 above is the universal structural theorem about RSA models with noisy observation chains. Per-paper findings are illustrations of how that structural fact manifests for specific (lex, kernel) shapes — not corollaries provable from §2 alone.