Documentation

Linglib.Core.Probability.DataProcessing

Data processing inequality for PMF.klDiv (finite-case) #

The data processing inequality (DPI) states that applying a (Markov) kernel to two distributions can only decrease their KL divergence:

klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q

This file proves DPI for PMFs over Fintypes, plus the underlying log-sum inequality from which DPI follows.

Main results #

Application #

DPI is the structural foundation for "informativity" claims in RSA-style models: as an observation kernel becomes noisier, the listener's posterior moves closer to the prior. Specifically, for the [GS13] cancellation theorem, DPI says that as speaker access decreases (kernel becomes less informative), KL(L1(a, u) ‖ prior) decreases — the listener gains less information from the utterance.

theorem Real.klFun_logSum_le {ι : Type u_1} (s : Finset ι) (x y : ι) (hx : is, 0 x i) (hy : is, 0 < y i) :
(∑ is, y i) * InformationTheory.klFun ((∑ is, x i) / is, y i) is, y i * InformationTheory.klFun (x i / y i)

Log-sum inequality (klFun version): for non-negative reals x_i, y_i with y_i > 0 summing to Y and x_i summing to X, Y * klFun(X / Y) ≤ ∑ y_i * klFun(x_i / y_i).

A direct corollary of Jensen's inequality applied to the convex function klFun on [0, ∞), with weights w_i = y_i / Y.

This is the foundation for the data processing inequality on KL divergence: when we marginalize a finite-supported distribution, the KL of marginals is at most the KL of joints.

theorem PMF.klDiv_bind_le {α : Type u_1} {β : Type u_2} [Fintype α] [Fintype β] [MeasurableSpace α] [MeasurableSpace β] [MeasurableSingletonClass α] [MeasurableSingletonClass β] (p q : PMF α) (κ : αPMF β) (h_ac : p.toMeasure.AbsolutelyContinuous q.toMeasure) :
(p.bind κ).klDiv (q.bind κ) p.klDiv q

Data Processing Inequality for PMF.klDiv (finite case): applying a Markov kernel κ : α → PMF β to two PMFs cannot increase their KL divergence: klDiv (p.bind κ) (q.bind κ) ≤ klDiv p q.

The information-theoretic content: post-processing observations through any noisy channel can only destroy information about the source. The only hypothesis is absolute continuity p ≪ q; absolute continuity of the binds is derived.

Data processing for mutual information #

Garbling one coordinate of a joint distribution through a channel cannot increase mutual information. This is the form of the DPI that constrains memory models: a representation generated from a context can carry no more information about another variable than the context itself ([FGL20] §3.2).

theorem PMF.toMeasure_absolutelyContinuous_product {α : Type u_1} {β : Type u_2} [Fintype α] [Fintype β] [MeasurableSpace α] [MeasurableSpace β] [DecidableEq α] [DecidableEq β] [MeasurableSpace (α × β)] [MeasurableSingletonClass (α × β)] (joint : PMF (α × β)) :
joint.toMeasure.AbsolutelyContinuous (joint.fst.product joint.snd).toMeasure

A joint distribution is absolutely continuous with respect to the product of its marginals.

theorem PMF.fst_bind_snd {α : Type u_1} {β : Type u_2} {γ : Type u_3} (joint : PMF (α × β)) (κ : βPMF γ) :
(joint.bind fun (x : α × β) => map (Prod.mk x.1) (κ x.2)).fst = joint.fst

Garbling the second coordinate preserves the first marginal.

theorem PMF.snd_bind_snd {α : Type u_1} {β : Type u_2} {γ : Type u_3} (joint : PMF (α × β)) (κ : βPMF γ) :
(joint.bind fun (x : α × β) => map (Prod.mk x.1) (κ x.2)).snd = joint.snd.bind κ

Garbling the second coordinate pushes the second marginal through the channel.

theorem PMF.product_bind {α : Type u_1} {β : Type u_2} {γ : Type u_3} (P : PMF α) (Q : PMF β) (κ : βPMF γ) :
((P.product Q).bind fun (x : α × β) => map (Prod.mk x.1) (κ x.2)) = P.product (Q.bind κ)

Garbling commutes with forming the product of marginals.

theorem PMF.mutualInformation_bind_snd_le {α : Type u_1} {β : Type u_2} [Fintype α] [Fintype β] {γ : Type u_3} [Fintype γ] [MeasurableSpace α] [MeasurableSpace β] [MeasurableSpace γ] [MeasurableSingletonClass α] [MeasurableSingletonClass β] [MeasurableSingletonClass γ] [DecidableEq α] [DecidableEq β] (joint : PMF (α × β)) (κ : βPMF γ) :
(joint.bind fun (x : α × β) => map (Prod.mk x.1) (κ x.2)).mutualInformation joint.mutualInformation

Data processing inequality, mutual-information form: passing one coordinate of a joint distribution through a channel cannot increase mutual information.