Dirichlet–multinomial distribution #
@cite{odonnell-2015}
The count-vector PMF associated with a PolyaUrn. Sequentially: the
distribution of (x_1, …, x_K) after drawing N times from the
urn, where x_i counts how often color i was drawn. By
Dirichlet–Categorical conjugacy, equivalently obtained by sampling
θ ~ Dirichlet(π) and then (x_1, …, x_K) ~ Multinomial(N; θ) and
integrating out θ.
Closed-form mass at x : α → ℕ with ∑ i, x i = N:
P(x | π) = (N! / ∏ x_i!) · Γ(Σ π) / Γ(Σ π + N) · ∏ Γ(π_i + x_i) / Γ(π_i)
The first factor is the multinomial coefficient (number of draw
sequences realizing the count vector x); the rest is
PolyaUrn.seqProb x (the per-sequence likelihood, defined in the
sibling file PolyaUrn.lean).
Two-tier mathlib pattern #
Mirrors Mathlib.Probability.Distributions.Poisson.Basic's
poissonPMFReal + poissonPMF: a raw-ℝ closed form pmfReal and
a proper PMF (α → ℕ) wrapper pmf constructed directly from a
HasSum proof.
Status #
Normalization (pmfReal_hasSum_one) is fully discharged. The two key
combinatorial lemmas are proved by induction on N via Fin.snoc
decomposition:
multinomCount_mul_prod_factorial: the multinomial counting identity#{seq | count seq = x} · ∏ x_i! = N!. StepN+1partitions sequences by their last element viamultinomCount_snoc_fiber.exists_seq_count_eq: every count vector withΣ = Nis realized by some sequence, by IH-induction (decrement at somecwithx c > 0, then snoc).
Candidate to upstream to mathlib alongside a proper
Mathlib/Probability/Distributions/Dirichlet.lean /
DirichletMultinomial.lean.
Why split from PolyaUrn.lean #
Downstream consumers (Theories/Morphology/FragmentGrammars/DMPCFG,
AdaptorGrammar, FragmentGrammar) consume only
PolyaUrn.seqProb — a corpus IS a labeled derivation sequence, not
a draw from the unlabeled-count distribution. Bundling the PMF
machinery into PolyaUrn.lean would force every consumer of
seqProb to pay the Probability.ProbabilityMassFunction.Basic
import cost (~10s of olean loading via MeasureTheory.Measure.Dirac).
Mathlib analog: Distributions/Poisson/Basic.lean is distinct from
Distributions/Poisson/Variance.lean for the same import-discipline
reason.
Why we avoid PMF.ofFinset + Finset.piAntidiag + Nat.multinomial #
Each lives in a heavier mathlib file
(ProbabilityMassFunction.Constructions, Algebra.Order.Antidiag.Pi,
Data.Nat.Choose.Multinomial) and we don't need them: the PMF can
be built directly from a HasSum (mathlib's PMF.ofFinset is
itself just ⟨_, hasSum_sum_of_ne_finset_zero ...⟩), the support
characterization fits as an inline if ∑ i, x i = N then _ else 0,
and the multinomial coefficient is just (∑ i, x i)! / ∏ i, (x i)!
written out. Mathlib's Poisson takes the same approach
(⟨ENNReal.ofReal ∘ poissonPMFReal r, _⟩).
The labeled→unlabeled .map to PMF (Nat.Partition N) is deferred
to a future file (would import Combinatorics.Enumerative.Partition
and ProbabilityMassFunction.Monad for .map).
Main definitions #
DirichletMultinomial.pmfReal— closed-form mass at a count vector, as a rawℝ(= multinomial coefficient ·seqProb).DirichletMultinomial.pmf— the distribution as a properPMF (α → ℕ), supported on count vectors summing toN.pmfReal_hasSum_onedischarges the HasSum chain via the multinomial counting identity + count-vector surjectivity (both proved in this file).
Dirichlet–multinomial closed-form mass at a count vector, as a
raw ℝ. The multinomial coefficient (∑ x_i)! / ∏ (x_i!) counts
how many draw sequences realize the count vector x; multiplying
by PolyaUrn.seqProb x gives the count-vector mass. For x
summing to N, this equals
(N! / ∏ x_i!) · Γ(Σ π) / Γ(Σ π + N) · ∏ Γ(π_i + x_i) / Γ(π_i) —
the standard Dirichlet–multinomial mass.
Equations
- ProbabilityTheory.DirichletMultinomial.pmfReal u x = (↑(∑ i : α, x i).factorial / ∏ i : α, ↑(x i).factorial) * u.seqProb x
Instances For
Dirichlet–multinomial normalization. The closed-form mass sums
to 1 over count vectors of any fixed total N (zero elsewhere).
Proof structure (sequence-based, not Dirichlet-integration-based):
PolyaUrn.sum_seqProb_eq_one: per-sequence likelihood sums to 1 overFin N → α(already proved).Finset.sum_fiberwise_of_maps_to: regroup the sequence sum by count vector; the inner sum over each fiber gives(card of fiber) · seqProb x.multinomCount_mul_prod_factorial(sorried):card of fiber = N!/∏x_i!, matchingpmfReal x = (N!/∏x_i!) · seqProb x.exists_seq_count_eq(sorried): every x withΣ = Nlies in the count image, so the function vanishes outside this finite set.hasSum_sum_of_ne_finset_zero+ENNReal.ofReal_sum_of_nonneg: lift the real-valued finite-sum result to the desiredHasSumover ENNReal.
Both sorries are routine combinatorial constructions on Fin N → α
(induction on N via Fin.snocEquiv, list-based enumeration of α).
The Dirichlet–multinomial distribution as a proper PMF (α → ℕ),
supported on count vectors summing to N.
Constructed directly from HasSum (mirroring mathlib's Poisson
⟨ENNReal.ofReal ∘ poissonPMFReal r, _⟩) rather than via heavier
PMF.ofFinset / Finset.piAntidiag machinery — see file docstring.
Equations
- ProbabilityTheory.DirichletMultinomial.pmf u N = ⟨fun (x : α → ℕ) => if ∑ i : α, x i = N then ENNReal.ofReal (ProbabilityTheory.DirichletMultinomial.pmfReal u x) else 0, ⋯⟩
Instances For
Outside its support (count vectors that don't sum to N), the
Dirichlet–multinomial PMF is zero.