Documentation

Linglib.Phenomena.Phonology.Studies.Flemming2021

@cite{flemming-2021}: Comparing MaxEnt and Noisy Harmonic Grammar #

@cite{flemming-2021}

@cite{flemming-2021} compares three stochastic Harmonic Grammar variants — MaxEnt, Noisy HG (NHG), and Normal MaxEnt — identifying logit uniformity as the diagnostic that distinguishes them.

The three models as Random Utility Models #

All three HG variants are Random Utility Models (RUMs) differing only in the noise distribution added to the deterministic harmony scores:

ModelNoise targetDistributionBinary PReference
MaxEntcandidatesGumbellogistic(H−H')maxent_eq_gumbelRUM
NHGweightsGaussianΦ((H−H')/σ_d)nhg_choiceProb_eq
Normal MaxEntcandidatesGaussianΦ((H−H')/(ε√2))normalMaxEnt_choiceProb_eq

Key diagnostic: logit uniformity #

MaxEnt exhibits logit uniformity (eq (10)): adding one violation of constraint j changes the logit by exactly −wⱼ, regardless of the tableau context. This follows from the log-odds identity (logit_uniformity):

log(P(a)/P(b)) = H(a) − H(b)

NHG violates logit uniformity because its noise standard deviation σ_d = σ · √(Σ(cⱼ(a)−cⱼ(b))²) (nhgSigmaD) depends on the violation difference profile. The same harmony difference ΔH produces different probits ΔH/σ_d in different contexts.

Normal MaxEnt has probit uniformity (constant σ_d = ε√2) rather than logit uniformity, leading to probit (Φ) rather than logistic probability functions — an empirically distinguishable prediction.

French schwa data #

Flemming tests logit uniformity on French schwa deletion across 8 phonological contexts with 6 constraints (Table (35)). Contexts that share the same *Clash violation difference should show the same logit difference under MaxEnt. We encode this data and verify:

theorem Flemming2021.maxent_eq_gumbelRUM {C : Type} [Fintype C] [Nonempty C] (constraints : List (Core.Constraint.WeightedConstraint C)) (c : C) :

MaxEnt = Gumbel RUM (@cite{flemming-2021} §4/§10): MaxEnt probability is exactly the McFadden integral with Gumbel scale β = 1.

This formalizes the RUM connection: MaxEnt adds i.i.d. Gumbel noise to candidate harmonies, and by McFadden's theorem (mcfaddenIntegral_eq_softmax), the resulting choice probability is softmax — i.e., the standard MaxEnt formula.

theorem Flemming2021.eq10_logit_harmony {C : Type} [Fintype C] [Nonempty C] (constraints : List (Core.Constraint.WeightedConstraint C)) (a b : C) :

Flemming's eq (10): logit(P_a) = h_a − h_b. The MaxEnt logit-harmony identity. Alias for maxent_logit_harmony.

theorem Flemming2021.iia {C : Type} [Fintype C] [Nonempty C] (constraints : List (Core.Constraint.WeightedConstraint C)) (a b : C) :

MaxEnt ratio independence (IIA): P(a)/P(b) = exp(H(a) − H(b)). The probability ratio depends only on the candidates' own scores, not on any other candidates. Corollary of softmax_odds with α = 1.

MaxEnt binary logistic (@cite{flemming-2021} eq (9)/(11)): with two candidates, MaxEnt probability is the logistic function of the harmony difference.

P(0) = 1 / (1 + e^{-(H(0) − H(1))}) = logistic(H(0) − H(1))

Corollary of softmax_binary with α = 1.

def Flemming2021.schwaDiff (ctx : Fin 8) (con : Fin 6) :

Violation difference matrix: ə candidate minus ∅ candidate. Rows = 8 contexts, columns = 6 constraints. Constraint order: 0=NoSchwa, 1=*CCC, 2=*Clash, 3=Max, 4=Dep, 5=*Cluster. Table (35) from @cite{flemming-2021}, data from @cite{smith-pater-2020}.

Equations
  • One or more equations did not get rendered due to their size.
Instances For
    def Flemming2021.clashPairs :
    Fin 4Fin 8 × Fin 8

    The four *Clash pairs: contexts that differ only in *Clash (index 2). Each pair is (without *Clash, with *Clash).

    Equations
    Instances For
      theorem Flemming2021.clash_pairs_identical_except_clash (pair : Fin 4) (j : Fin 6) (hj : j 2) :
      schwaDiff (clashPairs pair).1 j = schwaDiff (clashPairs pair).2 j

      *Clash pairs differ only in the *Clash column (index 2): for each pair, all non-*Clash violations are identical.

      theorem Flemming2021.clash_diff_is_one (pair : Fin 4) :
      schwaDiff (clashPairs pair).2 2 - schwaDiff (clashPairs pair).1 2 = 1

      The *Clash violation difference is exactly 1 for all pairs.

      theorem Flemming2021.logit_uniformity_clash (w : Fin 6) (pair : Fin 4) :
      j : Fin 6, w j * (schwaDiff (clashPairs pair).2 j) - j : Fin 6, w j * (schwaDiff (clashPairs pair).1 j) = w 2

      Logit uniformity for *Clash (@cite{flemming-2021} §7.1): the *Clash contribution to the harmony difference is the same across all four paired contexts.

      For any weights w, the harmony difference change between paired contexts = −w₂ (*Clash weight), independent of context. This follows from clash_pairs_identical_except_clash: since non-*Clash violations are identical in each pair, their weighted contributions cancel, leaving only −w₂ · 1 = −w₂.

      This is a special case of me_predicts_hz (Separability.lean): the *Clash violation differences are column-insensitive (constant across paired contexts), so the weighted sum satisfies the constant-difference identity.

      def Flemming2021.observedP :
      Fin 8

      Observed probability of schwa realization across 8 contexts. Data from @cite{smith-pater-2020} (Table 2 of @cite{flemming-2021}).

      Values are approximate proportions (hundredths). The key pattern: within each *Clash pair, the +*Clash context always has higher P(schwa), consistent with the *Clash constraint favoring schwa insertion.

      Equations
      Instances For

        Adding a *Clash violation increases P(schwa) in every paired context.

        def Flemming2021.schwaSqSum (ctx : Fin 8) :

        Sum of squared violation differences for a context.

        This is the study-local analogue of violationDiffSqSumQ from NoisyHG.lean: both compute Σⱼ (cⱼ(ə) − cⱼ(∅))², but schwaSqSum operates on the pre-computed difference matrix schwaDiff (Table (35)) rather than a WeightedConstraint list.

        Equations
        Instances For
          theorem Flemming2021.nhg_sigmaD_sq_varies :
          schwaSqSum 0 = 3 schwaSqSum 1 = 4 schwaSqSum 2 = 3 schwaSqSum 3 = 4 schwaSqSum 4 = 3 schwaSqSum 5 = 4 schwaSqSum 6 = 3 schwaSqSum 7 = 4

          NHG noise variance σ_d² is context-dependent: without *Clash, the squared violation sum is 3; with *Clash, it is 4. The same *Clash violation change produces different σ_d values in different tableaux — σ_d = √3 vs σ_d = 2 (Table 3 of @cite{flemming-2021}).

          noncomputable def Flemming2021.nhgProbitChange (h_init Δh σ_d σ_d' : ) :

          NHG probit change when moving from one context to another: the change in the probit Φ⁻¹(P) = Δh / σ_d when σ_d changes.

          h_init = initial harmony difference, Δh = harmony change (e.g., −w_Clash), σ_d / σ_d' = noise s.d. before/after the change.

          Equations
          Instances For
            theorem Flemming2021.nhg_probit_change_depends_on_h_init (Δh σ_d σ_d' h₁ h₂ : ) ( : σ_d σ_d') (hσ_pos : 0 < σ_d) (hσ'_pos : 0 < σ_d') (hh : h₁ h₂) :
            nhgProbitChange h₁ Δh σ_d σ_d' nhgProbitChange h₂ Δh σ_d σ_d'

            Probit non-uniformity (@cite{flemming-2021} §7.2): when σ_d ≠ σ_d', the NHG probit change depends on the initial harmony difference h_init.

            Two contexts with different initial harmonies h₁ ≠ h₂ but the same *Clash change Δh produce different probit changes. This is because the denominator shift (σ_d → σ_d') rescales the existing harmony difference differently depending on its magnitude.

            Concretely, for French schwa with σ = 1 (@cite{flemming-2021} §7.2): adding a *Clash violation changes σ_d from √3 to 2 in all pairs, but the initial harmony difference h_ə − h_∅ differs between pairs (e.g., −2.2 for pair (0,1) vs 0.01 for pair (4,5)), so the probit changes differ despite the same *Clash change.

            theorem Flemming2021.nhgProbitChange_decomp (h_init Δh σ_d σ_d' : ) (hσ_pos : 0 < σ_d) (hσ'_pos : 0 < σ_d') :
            nhgProbitChange h_init Δh σ_d σ_d' = h_init * (σ_d - σ_d') / (σ_d * σ_d') + Δh / σ_d'

            Probit change decomposition (@cite{flemming-2021} eq (38b)): the NHG probit change decomposes into a context-dependent term (proportional to initial harmony difference) and a uniform term.

            Δprobit = h · (σ_d − σ_d') / (σ_d · σ_d') + Δh / σ_d'

            The first term is why NHG violates probit uniformity: it depends on h_init, which varies across contexts.

            @[implicit_reducible]
            Equations
            Equations
            • One or more equations did not get rendered due to their size.
            Instances For

              In MaxEnt, equal harmony implies equal probability: since softmax(s, α, b) = exp(α·s(b)) / Σ exp(α·s(i)), candidates with the same score get the same numerator and hence the same probability.

              This is the MaxEnt half of the §9 contrast: MaxEnt assigns P(b) = P(c) (both have H = −16), while NHG assigns P(b) ≠ P(c) because their noise variances differ (table45_nhg_variance_differs).

              NHG noise covariance value: Cov(ε_b−ε_a, ε_c−ε_a) = 3σ².

              The paper (@cite{flemming-2021} §9, p. 37) computes Cov(ε_a−ε_b, ε_c−ε_b) = 2σ² using candidate b as reference. Our formalization uses candidate a as reference, giving 3σ² — a different but equally valid demonstration that the covariance matrix is non-diagonal.

              NHG noise covariance is non-zero: Cov(ε_b−ε_a, ε_c−ε_a) ≠ 0. The multivariate normal over score differences has a non-diagonal covariance matrix, so binary comparisons don't determine the joint distribution — NHG violates IIA (@cite{flemming-2021} §9).

              The /∅/ square: contexts 0–3 (underlying /∅/, varying onset × stress).

              Equations
              Instances For

                The /ə/ square: contexts 4–7 (underlying /ə/, varying onset × stress).

                Equations
                Instances For

                  Violation differences satisfy independence on the /∅/ square: each of the 6 constraints is insensitive to either onset (row) or stress (column).

                  Violation differences satisfy independence on the /ə/ square.

                  theorem Flemming2021.schwaNull_hz (w : Fin 6) :
                  Core.Constraint.ConstantLogitDiff (fun (ctx : Fin 8) => k : Fin 6, w k * (schwaDiff ctx k)) schwaSquareNull

                  HZ's generalization for French schwa (/∅/ square): for any MaxEnt weights, the logit-rate difference across onset types is constant across stress contexts. Derived from me_predicts_hz + schwaNull_independence.

                  theorem Flemming2021.schwaSchwa_hz (w : Fin 6) :
                  Core.Constraint.ConstantLogitDiff (fun (ctx : Fin 8) => k : Fin 6, w k * (schwaDiff ctx k)) schwaSquareSchwa

                  HZ's generalization for French schwa (/ə/ square).

                  Flemming's three-candidate table-(45) MaxEnt model is a ConstraintSystem Cand3 ℝ decoded by softmaxDecoder 1. The key observation H(b) = H(c) ⟹ predict b = predict c is a property of any MaxEnt-style decoder: it depends only on the score equality and the symmetry of softmax. By contrast NHG and Normal MaxEnt distinguish b and c despite equal harmony (table45_nhg_variance_differs, table45_nhg_covariance_nonzero) — same score, different decoder, different predict.

                  The MaxEnt system assigns equal probability to candidates b and c, matching table45_maxent_equal_prob but stated through the generic predict API. The frame here makes the framework distinction sharp: NHG (argmax after Gaussian noise on weights) and Normal MaxEnt (argmax after Gaussian noise on candidates) would assign different probabilities to b and c despite identical harmonies because they differ only in decoder, not in score.

                  The MaxEnt system on Cand3 is a probability distribution.