Data processing inequality for PMF.klDiv (finite-case) #
The data processing inequality (DPI) states that applying a (Markov) kernel to two distributions can only decrease their KL divergence:
klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q
This file proves DPI for PMFs over Fintypes, plus the underlying log-sum
inequality from which DPI follows.
Main results #
Real.klFun_logSum_le: log-sum inequality —Y * klFun(X / Y) ≤ ∑ b_i * klFun(a_i / b_i)for non-negative reals.PMF.klDiv_bind_le: DPI for bind on finite types.PMF.klDiv_map_le: DPI for map (deterministic kernel) on finite types.
Application #
DPI is the structural foundation for "informativity" claims in RSA-style
models: as an observation kernel becomes noisier, the listener's posterior
moves closer to the prior. Specifically, for the [GS13]
cancellation theorem, DPI says that as speaker access decreases (kernel
becomes less informative), KL(L1(a, u) ‖ prior) decreases — the
listener gains less information from the utterance.
Log-sum inequality (klFun version): for non-negative reals
x_i, y_i with y_i > 0 summing to Y and x_i summing to X,
Y * klFun(X / Y) ≤ ∑ y_i * klFun(x_i / y_i).
A direct corollary of Jensen's inequality applied to the convex function
klFun on [0, ∞), with weights w_i = y_i / Y.
This is the foundation for the data processing inequality on KL divergence: when we marginalize a finite-supported distribution, the KL of marginals is at most the KL of joints.
Data Processing Inequality for PMF.klDiv (finite case):
applying a Markov kernel κ : α → PMF β to two PMFs cannot increase
their KL divergence:
klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q.
The information-theoretic content: post-processing observations through
any noisy channel can only destroy information about the source. The only
hypothesis is absolute continuity p ≪ q; absolute continuity of the
binds is derived.
Data processing for mutual information #
Garbling one coordinate of a joint distribution through a channel cannot increase mutual information. This is the form of the DPI that constrains memory models: a representation generated from a context can carry no more information about another variable than the context itself ([FGL20] §3.2).
A joint distribution is absolutely continuous with respect to the product of its marginals.
Data processing inequality, mutual-information form: passing one coordinate of a joint distribution through a channel cannot increase mutual information.