Pólya urn (per-sequence likelihood) #

@cite{odonnell-2015}

A Pólya urn over an alphabet α is a sequential sampling scheme governed by strictly-positive pseudo-counts π : α → ℝ. The first draw is categorical with weights π_i / Σ π; the (N + 1)-th draw conditional on previous counts x_1, … is categorical with weights (π_i + x_i) / (Σ π + Σ x) — a preferential-attachment dynamic, finite-K variant of the power-law-tail dynamic that the Pitman–Yor process exhibits in the unbounded-alphabet limit.

By Dirichlet–Categorical conjugacy, drawing θ ~ Dirichlet(π) then sampling i.i.d. from Categorical(θ) and integrating out θ yields the same exchangeable sequence law (the de Finetti representation theorem guarantees that some mixing measure exists; identifying it as Dirichlet is conjugacy + integration). The probability of any specific draw sequence with counts x_1, …, x_K has the closed form of @cite{odonnell-2015} eq 3.7 (-- UNVERIFIED: section/equation number from memory; verify against PDF):

P(seq | π) = Γ(Σ π) / Γ(Σ π + Σ x)  ·  ∏ Γ(π_i + x_i) / Γ(π_i)

This file gives only the closed-form per-sequence likelihood seqProb — the form fragment-grammar consumers in Theories/Morphology/FragmentGrammars/ actually use (a corpus IS a labeled derivation sequence, not a draw from the unlabeled-count distribution). The corresponding count-vector PMF — the "Dirichlet–multinomial distribution" — lives in the sibling file DirichletMultinomial.lean, which carries the heavier Probability.ProbabilityMassFunction.Basic import (transitively ~10s of olean loading via MeasureTheory.Measure.Dirac).

The sequential sampler itself is not formalized — only the closed form is needed by downstream constructions.

Type-polymorphic alphabet #

The alphabet α is an arbitrary type; operations require [Fintype α] (so that ∑ i, ... and ∏ i, ... are well-defined), and theorems requiring positivity of the total pseudo-count additionally need [Nonempty α]. The previous Fin K-indexed shape is the special case α = Fin K (with [NeZero K] equivalent to [Nonempty (Fin K)]); the polymorphic shape composes cleanly with Finset-restricted alphabets needed by per-LHS PCFG factors.

Relationship to `PitmanYor` #

The Pólya urn is often described as the "finite-K Chinese Restaurant Process". This is correct sequentially but misleading distributionally: the labeled count distribution DirichletMultinomial.pmf (sibling file) is not equal at any finite K to the partition distribution PitmanYor.partitionProb at discount = 0. The two agree only in the limit K → ∞ with symmetric pseudo-counts π_i = b/K (Blackwell & MacQueen 1973; Ferguson 1973). The bridge is therefore a limit theorem, not a finite equality, and is not yet formalized — the labeled→unlabeled .map pushforward to PMF (Nat.Partition N) (the natural target type for such a bridge) is also deferred.

Main definitions #

PolyaUrn α — pseudo-counts on α (the Dirichlet hyperparameters).
PolyaUrn.total — the sum Σ π_i.
PolyaUrn.seqProb — closed-form per-sequence likelihood (eq 3.7 of @cite{odonnell-2015}, depending only on counts).

References #

@cite{odonnell-2015} — Pólya-urn closed form for DMPCFG (-- UNVERIFIED: §3.1.3 eq 3.7 from memory; verify against PDF).
Blackwell, D. & MacQueen, J. B. (1973). "Ferguson distributions via Pólya urn schemes". The Annals of Statistics 1(2): 353–355.
Ferguson, T. S. (1973). "A Bayesian analysis of some nonparametric problems". The Annals of Statistics 1(2): 209–230.

Pólya urn (per-sequence likelihood) #

Type-polymorphic alphabet #

Relationship to PitmanYor #

Main definitions #

References #

Relationship to `PitmanYor` #