Multinomial probabilistic context-free grammar #
@cite{odonnell-2015}
The baseline stochastic CFG: for each nonterminal A, a multinomial
distribution over A's expansions; the probability of a derivation
tree factorizes as the product of the rule weights it uses, and the
probability of a corpus factorizes across derivations.
This construction predates @cite{odonnell-2015} by decades — see
Booth 1969, Booth & Thompson 1973, Chi & Geman 1998 for the
canonical literature on PCFG and per-LHS multinomial structure (these
are not yet in references.bib; cite-key additions deferred).
@cite{odonnell-2015} §3.1.2 gives the didactic treatment we follow
for notation and as the substrate for the §3.1.4–§3.1.8 family of
priors over multinomial PCFGs (DMPCFG, MAG, FG) that build on this
baseline.
Factorization (@cite{odonnell-2015} eq 3.2 + eq 3.5) #
Per @cite{odonnell-2015} eq 3.2, the per-derivation probability is the product of the rule weights it uses:
P(d | G) = ∏_{r ∈ d} θ_r (eq 3.2)
Per @cite{odonnell-2015} eq 3.5, the corpus probability collects these into a single product over rules:
P(D | G) = ∏_{A ∈ V_NT} ∏_{r ∈ R^A} (θ_r^A)^{x_r^A} (eq 3.5)
derivProb formalizes eq 3.2; corpusProb is its natural lift to a
multiset of derivations. The count-form (eq 3.5) is provably
equivalent to the per-derivation form (corpusProb_eq_prod_pow_count,
deferred — needs derivRuleCount extracted from DMPCFG to a
shared substrate first).
The factorization across derivations (corpusProb_add) is what
distinguishes the multinomial-PCFG baseline from its richer-prior
descendants DMPCFG, MAG, FG, where exchangeable Pólya / PYP /
beta-binomial state couples derivations through shared corpus-aggregate
counts. DOP estimators (DOP1, ENDOP) also factorize via Goodman
1998/2003's PCFG reductions, despite being on the
"non-baseline" side of the substrate; see Comparisons.lean.
Architecture #
MultinomialPCFG G is a single point in weight space: for each LHS,
a PMF over its rule bucket. Mathlib's PMF discipline is genuine
here, not aspirational — normalization is part of what a probability
mass function is, so the previous ruleProb_nonneg /
ruleProb_normalized side conditions disappear and noncomputable is
forced only by PMF's ℝ≥0∞ carrier (not by our use of ℝ).
The forgetful projection to WeightedCFG G ℝ≥0∞ (Core/Computability/ WeightedCFG.lean) is toWeightedCFG. The bridge from richer-prior
descendants is the function DMPCFG.posteriorMAP : DMPCFG G → Multiset _ → MultinomialPCFG G (DMPCFG.lean): collapse a Dirichlet
prior, conditioned on a corpus, into its MAP point estimate. DMPCFG
does not extends MultinomialPCFG; the two are conceptually
distinct objects — a prior over weight-points versus a single
weight-point.
The structure requires [∀ a : G.NT, Nonempty (G.RulesWithLHS a)]:
PMFs over empty supports don't exist, so grammars with "useless"
nonterminals (no expansion) cannot carry a MultinomialPCFG. This
constraint is now structural rather than implicit (the previous
ruleProb_normalized field demanded sum = 1 for every a, which was
unsatisfiable when the LHS bucket was empty).
Main definitions #
MultinomialPCFG G— per-LHS PMF over the LHS rule bucket.MultinomialPCFG.ruleProb— per-rule probability (PMF mass on the rule's LHS bucket whenr ∈ G.rules, else 0).MultinomialPCFG.toWeightedCFG— forgetful projection toWeightedCFG G ℝ≥0∞.MultinomialPCFG.derivProb— per-derivation probability, recursive product of rule weights through the tree (eq 3.2).MultinomialPCFG.corpusProb— corpus probability as the product of per-derivation probabilities.
Main theorems #
MultinomialPCFG.corpusProb_add— multiplicativity over disjoint corpora (the Lean content of the "derivations are independent" claim).MultinomialPCFG.corpusProb_zero— empty corpus has probability 1.
References #
- @cite{odonnell-2015} §3.1.2 (eq 3.2 + eq 3.5).
A multinomial PCFG over G: for each nonterminal a, a PMF over
the rules with LHS a. Per @cite{odonnell-2015} §3.1.2.
The structure carries normalization as a structural property via
mathlib's PMF (the partition function is bundled into what a PMF
is), eliminating the hand-rolled ruleProb_nonneg / ruleProb_normalized
side conditions that the previous shape carried.
Requires [∀ a, Nonempty (G.RulesWithLHS a)] because PMF.ofFintype
fails on empty supports — grammars with "useless nonterminals" (no
rules) cannot carry a MultinomialPCFG.
- rulePMF (a : G.NT) : PMF (G.RulesWithLHS a)
Per-LHS distribution over the rules sharing that LHS.
Instances For
Per-rule probability under a multinomial PCFG: the PMF mass at the
rule's LHS bucket if r ∈ G.rules, else 0. The case split is on
membership in the bucket G.rules.filter (·.input = r.input), which
is equivalent to r ∈ G.rules (the LHS-equality side is trivially
true).
Returns ℝ≥0∞ rather than ℝ so that nonnegativity is structural
and the value composes with mathlib's PMF API directly. The ℝ
view is ENNReal.toReal ∘ ruleProb.
Equations
Instances For
The per-rule probability is 0 for rules outside the grammar.
The MultinomialPCFG only assigns mass to rules of G; anything
outside is implicitly weight 0.
The per-rule probability matches the PMF mass when r ∈ G.rules.
The bucket-membership proof is reconstructed from r ∈ G.rules
plus the trivial r.input = r.input.
Forgetful projection to the unbundled weighted-CFG substrate
(Core/Computability/WeightedCFG.lean). The weights live in ℝ≥0∞,
with nonnegativity automatic by typing.
This is the projection promised by the previous-version docstring
("the unbundled 'weighted CFG' is genuinely useful for theories
where weights are not yet normalized; will be introduced when the
first such consumer arrives") — DMPCFG is that consumer; this is
the projection from MultinomialPCFG into the new shared substrate.
Equations
- W.toWeightedCFG = { weight := W.ruleProb, weight_nonneg := ⋯ }
Instances For
Per-derivation probability under a multinomial PCFG
(@cite{odonnell-2015} eq 3.2). Recurses on the tree structure:
each internal node contributes the weight of the rule it
instantiates, leaves contribute 1. Invalid rules (those not in
G.rules) contribute 0 via ruleProb's default.
Equations
- W.derivProb (CFGTree.leaf t) = 1
- W.derivProb (CFGTree.node nt cs) = W.ruleProb { input := nt, output := List.map CFGTree.rootSymbol cs } * W.derivProbList cs
Instances For
Product of derivation probabilities over a list of subtrees.
Equations
- W.derivProbList [] = 1
- W.derivProbList (t :: ts) = W.derivProb t * W.derivProbList ts
Instances For
Corpus probability of a multiset of derivations: the product of their
individual derivProb values. By construction this factorizes
multiplicatively over disjoint corpora — see corpusProb_add. The
count-form (@cite{odonnell-2015} eq 3.5) is mathematically equivalent
but deferred (corpusProb_eq_prod_pow_count) until derivRuleCount
is extracted from DMPCFG to a shared substrate file.
Equations
- W.corpusProb D = (Multiset.map W.derivProb D).prod
Instances For
The empty corpus has probability 1 — the empty product.
Multiplicativity over disjoint corpora: the corpus probability of a
union of two sub-corpora is the product of their corpus
probabilities. This is the Lean content of the "derivations are
independent" claim that distinguishes multinomial PCFGs from their
richer-prior descendants (DMPCFG, MAG, FG) — see
Comparisons.lean for the formal contrast.
Trivially true here by Multiset.prod_add; the content is that
the analogous theorem for DMPCFG.corpusProb fails (because the
Pólya factor couples derivations through shared rule counts).
Concrete instances #
The uniform multinomial PCFG: each LHS bucket gets the
uniform distribution over its rules. The canonical baseline
instance — the one a maximum-entropy estimator returns at zero
data, useful as a reference point and as the Inhabited witness.
Maximum-entropy property: for each LHS, this distribution maximizes
PMF.entropy over the rule bucket.
Equations
- Morphology.FragmentGrammars.MultinomialPCFG.uniform = { rulePMF := fun (a : G.NT) => PMF.uniformOfFintype (G.RulesWithLHS a) }
Instances For
Equations
Information-theoretic primitives (bridge to Entropy) #
The per-LHS PMFs let us inherit mathlib's PMF entropy / KL machinery
directly. These primitives are the integration bridge to processing-cost
theories (Theories/Processing/MemorySurprisal/): rule-weight entropy
gives the local uncertainty at each nonterminal expansion choice, and
KL between two MultinomialPCFGs measures how different their
rule-weight predictions are.
Per-LHS entropy: entropy of the PMF over the rules with the given
LHS. The local "uncertainty" of which expansion will be chosen for
nonterminal a. Inherited via PMF.entropy.
Equations
- W.lhsEntropy a = (W.rulePMF a).entropy
Instances For
Entropy is nonneg — direct corollary of PMF.entropy_nonneg.
Count-form factorization (@cite{odonnell-2015} eq 3.5) #
The count-form factorization (@cite{odonnell-2015} eq 3.5).
For a corpus D of valid derivation trees, the corpus probability
collects rule contributions into a single product over the grammar's
rules raised to their corpus counts:
P(D | G) = ∏_{r ∈ G.rules} (W.ruleProb r) ^ (corpusRuleCount r D)
The validity precondition is essential: invalid trees use rules
outside G.rules, so the per-derivation product derivProb collapses
to 0 (via ruleProb's default), but the count-form product over
G.rules ignores those rules and would compute a nonzero value.
For valid trees the two coincide.
Status: stated, sorried. Proof requires (1) per-tree count-form
derivProb t = ∏_{r ∈ G.rules} ruleProb r ^ ruleCount r t for valid
t, by mutual induction on CFGTree/derivProb's mutual structure;
then (2) Multiset induction on D using corpusProb_add +
corpusRuleCount_add + Finset.prod_pow_add. The mutual induction
in (1) is the hard part — Lean's auto-generated derivProb.induct
needs careful handling. Architecture is in place; mechanical proof
deferred to a focused session.