Data processing inequality for PMF.klDiv (finite-case) #
The data processing inequality (DPI) states that applying a (Markov) kernel to two distributions can only decrease their KL divergence:
klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q
This file proves DPI for PMFs over Fintypes, plus the underlying log-sum
inequality from which DPI follows.
Main results #
Real.klFun_logSum_le: log-sum inequality —Y * klFun(X / Y) ≤ ∑ b_i * klFun(a_i / b_i)for non-negative reals.PMF.klDiv_bind_le: DPI for bind on finite types.PMF.klDiv_map_le: DPI for map (deterministic kernel) on finite types.
Application #
DPI is the structural foundation for "informativity" claims in RSA-style
models: as an observation kernel becomes noisier, the listener's posterior
moves closer to the prior. Specifically, for the @cite{goodman-stuhlmuller-2013}
cancellation theorem, DPI says that as speaker access decreases (kernel
becomes less informative), KL(L1(a, u) ‖ prior) decreases — the
listener gains less information from the utterance.
Log-sum inequality (klFun version): for non-negative reals
x_i, y_i with y_i > 0 summing to Y and x_i summing to X,
Y * klFun(X / Y) ≤ ∑ y_i * klFun(x_i / y_i).
A direct corollary of Jensen's inequality applied to the convex function
klFun on [0, ∞), with weights w_i = y_i / Y.
This is the foundation for the data processing inequality on KL divergence: when we marginalize a finite-supported distribution, the KL of marginals is at most the KL of joints.
Data Processing Inequality for PMF.klDiv (finite case):
applying a Markov kernel κ : α → PMF β to two PMFs cannot increase
their KL divergence:
klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q.
The information-theoretic content: post-processing observations through any noisy channel can only destroy information about the source.
Hypotheses: p ≪ q (otherwise both sides may be ∞ and the inequality
is vacuous in a less interesting way) and q a ≠ 0 → κ a b ≠ 0 for all
a, b where the inequality at b matters (encoded via the
(q.bind κ) b ≠ 0 → ... flow). For finite-support discrete measures,
both hypotheses reduce to checking at finitely many (a, b) pairs.
This is the core principle that drives:
- the cancellation theorem in @cite{goodman-stuhlmuller-2013} (less informative speaker access → smaller KL of L1 from prior),
- post-processing of any kernel-decomposed RSA model.