Documentation

Linglib.Core.Probability.DataProcessing

Data processing inequality for PMF.klDiv (finite-case) #

The data processing inequality (DPI) states that applying a (Markov) kernel to two distributions can only decrease their KL divergence:

klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q

This file proves DPI for PMFs over Fintypes, plus the underlying log-sum inequality from which DPI follows.

Main results #

Application #

DPI is the structural foundation for "informativity" claims in RSA-style models: as an observation kernel becomes noisier, the listener's posterior moves closer to the prior. Specifically, for the @cite{goodman-stuhlmuller-2013} cancellation theorem, DPI says that as speaker access decreases (kernel becomes less informative), KL(L1(a, u) ‖ prior) decreases — the listener gains less information from the utterance.

theorem Real.klFun_logSum_le {ι : Type u_1} (s : Finset ι) (x y : ι) (hx : is, 0 x i) (hy : is, 0 < y i) :
(∑ is, y i) * InformationTheory.klFun ((∑ is, x i) / is, y i) is, y i * InformationTheory.klFun (x i / y i)

Log-sum inequality (klFun version): for non-negative reals x_i, y_i with y_i > 0 summing to Y and x_i summing to X, Y * klFun(X / Y) ≤ ∑ y_i * klFun(x_i / y_i).

A direct corollary of Jensen's inequality applied to the convex function klFun on [0, ∞), with weights w_i = y_i / Y.

This is the foundation for the data processing inequality on KL divergence: when we marginalize a finite-supported distribution, the KL of marginals is at most the KL of joints.

theorem PMF.klDiv_bind_le {α : Type u_1} {β : Type u_2} [Fintype α] [Fintype β] [MeasurableSpace α] [MeasurableSpace β] [MeasurableSingletonClass α] [MeasurableSingletonClass β] (p q : PMF α) (κ : αPMF β) (h_ac : p.toMeasure.AbsolutelyContinuous q.toMeasure) (h_bind_ac : (p.bind κ).toMeasure.AbsolutelyContinuous (q.bind κ).toMeasure) (h_q_pos : ∀ (a : α), q a 0) (h_κ_pos : ∀ (a : α) (b : β), q a 0(q.bind κ) b 0(κ a) b 0) :
(p.bind κ).klDiv (q.bind κ) p.klDiv q

Data Processing Inequality for PMF.klDiv (finite case): applying a Markov kernel κ : α → PMF β to two PMFs cannot increase their KL divergence: klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q.

The information-theoretic content: post-processing observations through any noisy channel can only destroy information about the source.

Hypotheses: p ≪ q (otherwise both sides may be and the inequality is vacuous in a less interesting way) and q a ≠ 0 → κ a b ≠ 0 for all a, b where the inequality at b matters (encoded via the (q.bind κ) b ≠ 0 → ... flow). For finite-support discrete measures, both hypotheses reduce to checking at finitely many (a, b) pairs.

This is the core principle that drives:

  • the cancellation theorem in @cite{goodman-stuhlmuller-2013} (less informative speaker access → smaller KL of L1 from prior),
  • post-processing of any kernel-decomposed RSA model.