Data processing inequality for `PMF.klDiv` (finite-case) #

The data processing inequality (DPI) states that applying a (Markov) kernel to two distributions can only decrease their KL divergence:

klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q

This file proves DPI for PMFs over Fintypes, plus the underlying log-sum inequality from which DPI follows.

Main results #

Real.klFun_logSum_le: log-sum inequality — Y * klFun(X / Y) ≤ ∑ b_i * klFun(a_i / b_i) for non-negative reals.
PMF.klDiv_bind_le: DPI for bind on finite types.
PMF.klDiv_map_le: DPI for map (deterministic kernel) on finite types.

Application #

DPI is the structural foundation for "informativity" claims in RSA-style models: as an observation kernel becomes noisier, the listener's posterior moves closer to the prior. Specifically, for the @cite{goodman-stuhlmuller-2013} cancellation theorem, DPI says that as speaker access decreases (kernel becomes less informative), KL(L1(a, u) ‖ prior) decreases — the listener gains less information from the utterance.

source

theorem Real.klFun_logSum_le {ι : Type u_1} (s : Finset ι) (x y : ι → ℝ) (hx : ∀ i ∈ s, 0 ≤ x i) (hy : ∀ i ∈ s, 0 < y i) :

(∑ i ∈ s, y i) * InformationTheory.klFun ((∑ i ∈ s, x i) / ∑ i ∈ s, y i) ≤ ∑ i ∈ s, y i * InformationTheory.klFun (x i / y i)

Log-sum inequality (klFun version): for non-negative reals x_i, y_i with y_i > 0 summing to Y and x_i summing to X, Y * klFun(X / Y) ≤ ∑ y_i * klFun(x_i / y_i).

A direct corollary of Jensen's inequality applied to the convex function klFun on [0, ∞), with weights w_i = y_i / Y.

This is the foundation for the data processing inequality on KL divergence: when we marginalize a finite-supported distribution, the KL of marginals is at most the KL of joints.

source

theorem PMF.klDiv_bind_le {α : Type u_1} {β : Type u_2} [Fintype α] [Fintype β] [MeasurableSpace α] [MeasurableSpace β] [MeasurableSingletonClass α] [MeasurableSingletonClass β] (p q : PMF α) (κ : α → PMF β) (h_ac : p.toMeasure.AbsolutelyContinuous q.toMeasure) (h_bind_ac : (p.bind κ).toMeasure.AbsolutelyContinuous (q.bind κ).toMeasure) (h_q_pos : ∀ (a : α), q a ≠ 0) (h_κ_pos : ∀ (a : α) (b : β), q a ≠ 0 → (q.bind κ) b ≠ 0 → (κ a) b ≠ 0) :

(p.bind κ).klDiv (q.bind κ) ≤ p.klDiv q

Data Processing Inequality for PMF.klDiv (finite case): applying a Markov kernel κ : α → PMF β to two PMFs cannot increase their KL divergence: klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q.

The information-theoretic content: post-processing observations through any noisy channel can only destroy information about the source.

Hypotheses: p ≪ q (otherwise both sides may be ∞ and the inequality is vacuous in a less interesting way) and q a ≠ 0 → κ a b ≠ 0 for all a, b where the inequality at b matters (encoded via the (q.bind κ) b ≠ 0 → ... flow). For finite-support discrete measures, both hypotheses reduce to checking at finitely many (a, b) pairs.

This is the core principle that drives:

the cancellation theorem in @cite{goodman-stuhlmuller-2013} (less informative speaker access → smaller KL of L1 from prior),
post-processing of any kernel-decomposed RSA model.

Documentation

Linglib.Core.Probability.DataProcessing

Data processing inequality for `PMF.klDiv` (finite-case) #

Main results #

Application #

Data processing inequality for PMF.klDiv (finite-case) #

Main results #

Application #

Data processing inequality for `PMF.klDiv` (finite-case) #