Dirichlet–multinomial PCFG (DMPCFG) #
@cite{odonnell-2015}
The "full-parsing" model of @cite{odonnell-2015} §2.4.1 / §3.1.4: a PCFG with a Dirichlet prior on the per-LHS rule-weight vector. Marginalizing out the rule weights yields the closed-form corpus probability of eq 3.9 — the Dirichlet–multinomial likelihood.
P(D | M) = ∏_{A ∈ V}
[ Γ(Σ_i π_i^A) / Γ(Σ_i π_i^A + Σ_i x_i^A)
· ∏_i Γ(π_i^A + x_i^A) / Γ(π_i^A) ]
where π_i^A are the Dirichlet hyperparameters (pseudo-counts) for
the i-th rule with LHS A, and x_i^A is the count of that rule
across the corpus D.
Why DMPCFG does not factorize #
The crucial difference from a multinomial PCFG (MultinomialPCFG)
is that the same rule's count enters the same Pólya term across all
derivations in the corpus. Two derivations using rule r are
correlated through the shared x_r in the closed form. As a result,
P(D | M) ≠ ∏_d P(d | M) — DMPCFG derivations are exchangeable but
not independent. The corpus probability is a corpus-level function;
there is no clean per-derivation factorization to expose.
Connection to PolyaUrn #
The per-LHS factor lhsFactor M D A IS the closed-form Pólya-urn
per-sequence likelihood over rules with LHS A, using M.pseudo
restricted to those rules. The Type-polymorphic PolyaUrn α from
Linglib.Core.Probability.PolyaUrn is constructed per-LHS via
lhsUrn; positivity of corpusProb is derived structurally from
PolyaUrn.seqProb_pos. (We use seqProb, not the count-vector PMF
polyaUrnPMFReal, because a corpus IS a specific labeled sequence
of derivations, not a draw from the unlabeled-count distribution.)
Main definitions #
DMPCFG G— per-rule Dirichlet pseudo-counts.- Rule application counts (
CFGTree.ruleCount,CFGTree.corpusRuleCount) are CFGTree-level operations inCore/Computability/CFGTree.lean; opened into this namespace. DMPCFG.lhsUrn— per-LHSPolyaUrnover the subtype of rules with that LHS.DMPCFG.lhsCounts— per-LHS corpus counts as a function on the subtype.DMPCFG.lhsFactor— per-LHS Pólya per-sequence likelihood (delegates toPolyaUrn.seqProb).DMPCFG.corpusProb— eq 3.9 closed-form corpus probability.
References #
- @cite{odonnell-2015} §2.4.1, §3.1.4 (eq 3.7–3.9).
A Dirichlet–multinomial PCFG over G: per-rule pseudo-counts that
parametrize a Dirichlet prior on the per-LHS rule-weight vector.
Per @cite{odonnell-2015} §2.4.1 the conditional distribution over
weights given a corpus is again Dirichlet (conjugacy). The "MAP
weights" (CL convention) are taken to be the Dirichlet posterior
mean (π + x) / Σ(π + x), proportional to corpus rule frequencies.
Terminology note: in strict Bayesian usage MAP = mode of posterior
over θ, with formula (π - 1) / Σ(π - 1) requiring π > 1 everywhere.
For the symmetric Dirichlet priors typical in CL (α ≤ 1) the strict
mode is at the simplex boundary, so practitioners use the posterior
mean (= posterior predictive probability) and call it "MAP" loosely.
@cite{odonnell-2015} follows this convention; we do too. The strict
mode is also formalized as posteriorMode; see that section for the
ordering-equivalence theorem (mapWeight_lt_iff_posteriorMode_lt)
that shows the loose convention is harmless for ordering claims.
- pseudo : ContextFreeRule T G.NT → ℝ
Per-rule pseudo-count (Dirichlet hyperparameter).
Pseudo-counts are strictly positive on grammar rules.
Instances For
Per-LHS Pólya urn over the subtype of rules with LHS a,
parameterized by M.pseudo restricted to those rules.
Equations
- M.lhsUrn a = { pseudo := fun (x : G.RulesWithLHS a) => match x with | ⟨r, property⟩ => M.pseudo r, pseudo_pos := ⋯ }
Instances For
Per-LHS corpus rule-count vector as a function on the subtype.
Equations
- Morphology.FragmentGrammars.DMPCFG.lhsCounts a D ⟨r, property⟩ = CFGTree.corpusRuleCount r D
Instances For
Per-LHS factor in the eq 3.9 product: the Pólya partition probability of the corpus rule-counts under the per-LHS pseudo-counts.
Γ(Σ π_i^A) / Γ(Σ π_i^A + Σ x_i^A) · ∏_i Γ(π_i^A + x_i^A) / Γ(π_i^A) .
Equations
- M.lhsFactor a D = (M.lhsUrn a).seqProb (Morphology.FragmentGrammars.DMPCFG.lhsCounts a D)
Instances For
DMPCFG corpus probability — eq 3.9 of @cite{odonnell-2015}. Product
over LHSs that appear in the grammar of the per-LHS Pólya factor.
LHSs with no rules in G contribute trivially (empty image term)
so the product is over G.rules.image (·.input).
Equations
- M.corpusProb D = ∏ a ∈ Finset.image (fun (x : ContextFreeRule T G.NT) => x.input) G.rules, M.lhsFactor a D
Instances For
For LHSs that have at least one rule, the subtype is nonempty.
The per-LHS factor is strictly positive when the LHS has rules.
Direct corollary of PolyaUrn.seqProb_pos.
DMPCFG corpus probabilities are nonnegative (in fact, positive).
Per-LHS counts vanish on the empty corpus.
The empty corpus has probability 1 under any DMPCFG: each
per-LHS Pólya factor is seqProb on the all-zero count
vector, which equals 1 by PolyaUrn.seqProb_zero.
Posterior MAP weights #
Per @cite{odonnell-2015} §2.4 (Dirichlet–multinomial conjugacy),
the posterior over rule weights given a corpus is again Dirichlet,
so per-rule posterior expected value (the MAP weight in the
symmetric prior limit) is
(π_r + x_r) / Σ_{r' with same LHS} (π_{r'} + x_{r'}).
This is the productivity score @cite{odonnell-2015} Ch 7 (p. 268)
identifies as DMPCFG's failure mode: when corpus counts dominate
pseudo-counts, frequency wins over the prior.
Per-rule MAP weight: posterior expected value of the rule's
Dirichlet weight given the corpus rule-counts. Equals
(pseudo r + count r D) / Σ_{r' with same LHS} (pseudo r' + count r' D).
Returns 0 by convention when the LHS has no rules in G (vacuous;
in practice r ∈ G.rules makes the denominator positive — see
mapWeight_denom_pos).
Equations
- M.mapWeight r D = (M.pseudo r + ↑(CFGTree.corpusRuleCount r D)) / ∑ r' ∈ G.rules with r'.input = r.input, (M.pseudo r' + ↑(CFGTree.corpusRuleCount r' D))
Instances For
The denominator of mapWeight is positive whenever the LHS
has at least one rule in G. Each summand pseudo r' + count r' D
is at least pseudo r' > 0 by pseudo_pos.
Takes the per-LHS nonemptiness as a Nonempty typeclass
argument rather than an explicit existential — mathlib idiom
for "this construction needs the bucket to be inhabited."
Consumers either declare instance : Nonempty (RulesWithLHS …) := ⟨⟨r, _⟩⟩
once for their grammar+LHS pair, or haveI the witness locally.
mapWeight is strictly positive for any rule in G. The
numerator is positive (pseudo_pos + nonneg cast) and the
denominator is positive (witness: r itself is in its own LHS
bucket). The Nonempty instance for RulesWithLHS r.input is
synthesised internally from hr.
Corollary: mapWeight is nonnegative for rules in G.
Empty-corpus reduction: mapWeight collapses to the prior
expected value pseudo r / Σ_{r' with same LHS} pseudo r'.
The Dirichlet posterior with no data IS the prior.
Same-LHS comparison (iff form): with a shared LHS, mapWeight
ordering is equivalent to pseudo + count ordering. The shared
denominator cancels. This is the technical core of the
@cite{odonnell-2015} Ch 7 DMPCFG critique: when corpus counts
overcome pseudo-count differences, the more-frequent rule wins
regardless of prior.
Directional form of mapWeight_lt_mapWeight_iff_of_same_lhs for
apply-style proofs.
The probability axiom: mapWeight over rules sharing an LHS
sums to 1. Each summand has the same denominator (the LHS-sum),
and the numerators sum to that same denominator, so the ratio
sums to 1. This is what justifies calling mapWeight a weight
rather than just a number — it is the per-LHS Dirichlet
posterior mass function.
PMF lift #
The per-LHS mapWeights sum to 1 (mapWeight_sum_eq_one_of_lhs),
so they form a probability mass function on the rules sharing that
LHS. mapWeightPMF packages this as mathlib's PMF, connecting
DMPCFG to the standard probability infrastructure: the resulting
object inherits support, bind, map, etc. from PMF, and
sum-to-1 is now a structural property (a field of PMF) rather
than a follow-up theorem.
The per-LHS Dirichlet posterior MAP weights as a probability mass function on the rules sharing that LHS.
Built via PMF.ofFintype over the subtype RulesWithLHS a,
using mapWeight_sum_eq_one_of_lhs for the normalisation
obligation. The PMF's value at ⟨r, hr⟩ is
ENNReal.ofReal (M.mapWeight r D) — see mapWeightPMF_apply.
Equations
- M.mapWeightPMF D = PMF.ofFintype (fun (x : G.RulesWithLHS a) => ENNReal.ofReal (M.mapWeight (↑x) D)) ⋯
Instances For
The PMF value at a rule equals the mapWeight value, lifted
to ℝ≥0∞. Bridges mapWeightPMF (the structural object) to
mapWeight (the numeric accessor). Holds by rfl via
PMF.ofFintype_apply.
PMF comparison lemma: at a shared LHS, the per-LHS posterior
PMF ordering reduces to pseudo + count ordering. Bundles
mapWeightPMF_apply + mapWeight_lt_mapWeight_iff_of_same_lhs
ENNReal.ofReal_lt_ofReal_iffso consumers can rewrite straight from PMF mass to numerator arithmetic. The shared LHS andr₂.1 ∈ G.rulespremises both come for free from the subtype proofsr₁.2,r₂.2.
Strict Bayesian mode (the actual MAP) #
mapWeight computes the Dirichlet posterior mean, which is what
@cite{odonnell-2015} and CL practice generally call "MAP" (a loose
convention). The strict Bayesian MAP — the single most likely value
of the rule-weight vector under the posterior — is the mode of
Dir(pseudo + count), with closed form (π_i - 1) / Σ(π_j - 1).
The mode is well-defined only when pseudo + count > 1 for every
rule in the LHS bucket. Otherwise the posterior density goes to ∞ at
a simplex vertex (when some π_i + x_i < 1) or has flat edges
(equality case). For symmetric Dirichlet priors with concentration
α ≤ 1 (the CL norm), the strict mode lives at the boundary; this is
exactly why CL uses the mean instead.
When both estimators are defined, they agree on orderings: rule
with higher pseudo + count has higher mean and higher mode. So
the productivity-ordering theorems we prove for mapWeight apply to
posteriorMode automatically (under its precondition). The mode adds
genuinely new content the mean can't express: vertex behavior,
asymptotic agreement (mode → mean as pseudo + count → ∞), and
boundary collapse.
The Dirichlet posterior mode of the rule weight: the single
most likely value of θ_r under Dir(pseudo + count). Closed form
(π_i - 1) / Σ(π_j - 1), valid in the simplex interior when
pseudo r' + count r' D > 1 for every r' in the LHS bucket
(otherwise the mode is at the boundary; see posteriorMode_well_defined).
Returns 0/0 = 0 (Lean convention) outside the well-defined
regime; theorems carry the well-definedness hypothesis.
Equations
- M.posteriorMode r D = (M.pseudo r + ↑(CFGTree.corpusRuleCount r D) - 1) / ∑ r' ∈ G.rules with r'.input = r.input, (M.pseudo r' + ↑(CFGTree.corpusRuleCount r' D) - 1)
Instances For
The mode-well-definedness condition: every entry of the LHS bucket exceeds 1 in the posterior. Equivalent to "the Dirichlet posterior has its mode strictly in the simplex interior."
Equations
- M.PosteriorModeWellDefined a D = ∀ r ∈ {x ∈ G.rules | x.input = a}, 1 < M.pseudo r + ↑(CFGTree.corpusRuleCount r D)
Instances For
The mode denominator is positive when the well-definedness condition holds. Each summand exceeds 0 by hypothesis.
posteriorMode is strictly positive on rules in G when the
well-definedness condition holds. The numerator and denominator
are both positive.
Same-LHS comparison (mode version). Mode ordering matches
pseudo + count ordering, just like mapWeight. The shared
denominator cancels. This is the key fact that makes CL's "MAP =
mean" loose convention harmless for ordering claims: both
estimators give the same productivity ranking.
Mean-mode ordering equivalence. Whenever both mapWeight
and posteriorMode are well-defined for the same comparison,
they agree on ordering of two rules sharing an LHS. This is the
formal Lean witness of "the mean-vs-mode distinction is moot for
productivity-ranking claims."
Bridge: DMPCFG → MultinomialPCFG via posterior MAP #
A DMPCFG is a prior over multinomial PCFGs (Dirichlet hyperparameters parameterize a distribution over per-LHS rule-weight vectors). The posterior given a corpus is again Dirichlet (conjugacy); the posterior MAP weights are a single multinomial PCFG.
This is the bridge: collapse a DMPCFG, conditioned on a corpus, into
its MAP point estimate. At the empty corpus (D = 0), the result is
the prior mean — see posteriorMAP_zero.
DMPCFG does not extends MultinomialPCFG: a prior is a different
kind of object than a point in weight space. The two relate via this
function, not via inheritance.
The Dirichlet posterior MAP collapse: given a corpus D, package
DMPCFG's per-LHS posterior MAP weights as a MultinomialPCFG.
Per-LHS PMFs come straight from mapWeightPMF; sum-to-1 is
structural via PMF.
Requires [∀ a, Nonempty (G.RulesWithLHS a)] (carried by
MultinomialPCFG itself): the construction can't deliver a
multinomial PCFG over a grammar with empty LHS buckets.
Equations
- M.posteriorMAP D = { rulePMF := fun (a : G.NT) => M.mapWeightPMF D }
Instances For
The posterior MAP MultinomialPCFG's per-LHS PMF unfolds to
DMPCFG's mapWeightPMF. Holds by rfl. Not @[simp] —
Lean unfolds eagerly anyway (consumers like ODonnell2015's
bridge demo use direct application without rw), and tagging
simp risks loops in unfamiliar simp contexts.
Conjugacy decomposition: posterior + mode #
The posteriorMAP collapses two distinct Bayesian operations:
DMPCFG.posterior : DMPCFG G → Multiset _ → DMPCFG G— Dirichlet-conjugate update: bump pseudo-counts by the corpus rule counts. The posterior of a Dirichlet given multinomial data is again a Dirichlet (conjugacy), so this stays inside the DMPCFG type.DMPCFG.mode : DMPCFG G → MultinomialPCFG G— the Dirichlet-mode projection (=posteriorMAPat the empty corpus).
The decomposition posteriorMAP = mode ∘ posterior is captured by
posteriorMAP_eq_mode_posterior. Naming the two primitives separately
makes the Bayesian structure visible in code and exposes incrementality
of the update (posterior_add) as a stand-alone theorem rather than
an unspoken consequence.
The Dirichlet-conjugate posterior update: given a corpus, bump
each rule's Dirichlet pseudo-count by that rule's corpus count.
Stays inside the DMPCFG type — that's the point of conjugacy.
Equations
- M.posterior D = { pseudo := fun (r : ContextFreeRule T G.NT) => M.pseudo r + ↑(CFGTree.corpusRuleCount r D), pseudo_pos := ⋯ }
Instances For
Posterior at the empty corpus is the prior: no data, no update.
Incrementality. Updating with D₁ + D₂ equals updating
sequentially with D₁ then D₂. Bayesian update is associative
over disjoint data — the actual content of "the prior commutes
with corpus aggregation." Follows from corpusRuleCount_add.
Mode of the Dirichlet prior in MultinomialPCFG-space:
the per-LHS-normalized pseudo-counts as a MultinomialPCFG.
Equivalently: posteriorMAP at the empty corpus (no data).
Equations
- M.mode = M.posteriorMAP 0
Instances For
The conjugacy decomposition. posteriorMAP factors as
mode ∘ posterior: first update the prior with data (Dirichlet
conjugacy), then take the mode (Dirichlet-to-MultinomialPCFG
projection). The Bayesian-MAP estimator stops being a primitive
and becomes a derived composition.