[GJ03]: Learning OT Constraint Rankings Using a Maximum Entropy Model #
[GJ03] introduce Maximum Entropy models to constraint-based phonology. Their eq (1) defines the conditional probability of output y given input x as:
Pr(y|x) = exp(Σ wᵢfᵢ(y,x)) / Z(x)
This IS softmax(harmonyScore constraints, 1) — the same softmax from
Core.Agent.RationalAction used throughout linglib for RSA pragmatics.
The phonology–pragmatics connection is structural: both are log-linear
models over weighted features, differing only in what the features measure.
Key contributions formalized here #
MaxEnt = softmax (§1):
gjProbis eq (1) by definition — the conditional probability issoftmax (harmonyScore con w (i, ·)), a constraint setcon : CON (I × O) nweighted byw : Fin n → ℝand softmax-decoded.Log-likelihood (§2): The learning objective is log pseudo-likelihood, which decomposes as
Σⱼ (H(yⱼ,xⱼ) − logSumExp(H(·,xⱼ)))— the harmony of each observed output minus the log-partition function.Concavity (§2): Log-likelihood is concave in each weight (
concavityvialogConditional_concaveOn). The decompositionlog P(y|x;w) = affine(wⱼ) − logSumExpOffset(s,r,wⱼ)makes this affine minus convex = concave, guaranteeing a unique global maximum.Learning gradient = E[feature] (§3): The derivative of log-likelihood w.r.t. each weight equals observed minus expected feature value (
gradientviahasDerivAt_logConditional). At the optimum, observed feature counts equal expected feature counts.Wolof data (§4): The learned weights for a categorical grammar are exponentially separated (
ExponentiallySeparated wolofWeights 1), confirming that MaxEnt reproduces OT for categorical data — the empirical counterpart ofmaxent_ot_limit.
[GJ03] eq (1): the conditional probability of output
o given input i is the softmax of the harmony score over a constraint
set con weighted by w — Pr(o ∣ i) ∝ exp(-harmonyScore con w (i, o)).
Mirrors the per-mapping probability of a classical MaxEnt grammar (a
constraint set + weight vector, no record).
Equations
- GoldwaterJohnson2003.gjProb con w i o = Core.softmax (fun (o' : O) => Constraints.harmonyScore con w (i, o')) o
Instances For
[GJ03] eq (1) is gjProb by definition — both are
softmax(harmonyScore con w, 1).
The same softmax function powers RSA pragmatic reasoning
(Core.Agent.RationalAction): both phonological grammar and
pragmatic inference are log-linear models over weighted features.
Log pseudo-likelihood of training data under a MaxEnt grammar (eq (2)).
logPL(w) = Σⱼ log P_w(yⱼ | xⱼ)
Each term decomposes as H(yⱼ, xⱼ) − logSumExp(H(·, xⱼ), 1) via
softmax_exponential_family.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Regularized objective (eq (3)): log-likelihood minus L2 penalty.
J(w) = logPL(w) − Σᵢ (wᵢ − μᵢ)² / (2σᵢ²)
[GJ03] use μᵢ = 0, σᵢ = σ for all constraints, so the
penalty is Σⱼ wⱼ² / (2σ²) over the grammar's weight vector.
Equations
- GoldwaterJohnson2003.regularizedObjective con w data σ = GoldwaterJohnson2003.logPseudoLikelihood con w data - ∑ j : Fin n, w j ^ 2 / (2 * σ ^ 2)
Instances For
Concavity of log-likelihood (fn. 4): the log conditional likelihood of a single observation is concave in each weight, guaranteeing a unique global maximum.
When all weights except wⱼ are held fixed, log P(y|x;w) decomposes as
(wⱼ · sᵧ + rᵧ) − logSumExpOffset(s, r, wⱼ) where sᵢ = −cⱼ(yᵢ,x)
and rᵢ = Σₖ≠ⱼ wₖ(−cₖ(yᵢ,x)). The first term is affine (hence concave);
the second is convex (logSumExpOffset_convex). Concave − convex = concave.
This is logConditional_concaveOn from Core.Agent.RationalAction.
Learning gradient = observed − expected (§2, §4.2):
∂/∂wⱼ log P(y|x) = sᵧ − Σᵢ softmaxOffset(s,r,wⱼ)ᵢ · sᵢ
where sᵢ = −cⱼ(yᵢ,x) (negated violation count of constraint j on
candidate i) and rᵢ = Σₖ≠ⱼ wₖ(−cₖ(yᵢ,x)) (contribution of other
constraints, constant w.r.t. wⱼ).
At the maximum, this derivative is zero, so sᵧ = E_P[s]: the
observed feature value equals the expected feature value.
This is hasDerivAt_logConditional from Core.Agent.RationalAction.
Wolof tongue-root harmony constraints (§3.1, from Boersma 1999).
The five constraints in ranked order, with learned MaxEnt weights from Table 1 (nσ² ≈ 1,200,000).
Equations
- GoldwaterJohnson2003.wolofWeights 0 = 3389 / 100
- GoldwaterJohnson2003.wolofWeights 1 = 17
- GoldwaterJohnson2003.wolofWeights 2 = 10
- GoldwaterJohnson2003.wolofWeights 3 = 353 / 100
- GoldwaterJohnson2003.wolofWeights 4 = 41 / 100
Instances For
All Wolof weights are positive.
The learned Wolof weights are exponentially separated (M = 1): each weight exceeds the sum of all lower-ranked weights.
This is the empirical counterpart of maxent_ot_limit: for a
categorical grammar (no free variation), MaxEnt learning produces
weights that satisfy ExponentiallySeparated, recovering OT's
strict ranking. The theoretical direction is:
ExponentiallySeparated ⟹ lex_imp_lower_violations ⟹ HG = OT
Here we verify the converse empirically: OT-like data ⟹ learning
produces ExponentiallySeparated weights.