Documentation

Linglib.Phenomena.Phonology.Studies.FrischPierrehumbertBroe2004

Frisch, Pierrehumbert & Broe (2004) @cite{frisch-pierrehumbert-broe-2004} #

Similarity Avoidance and the OCP. Natural Language & Linguistic Theory 22(1):179–228.

@cite{frisch-pierrehumbert-broe-2004} (FPB) argue that the OCP-Place constraint in Arabic verbal roots is a gradient constraint whose strength is a quantitative function of the similarity between homorganic consonants. The categorical OCP-Place analyses of @cite{mccarthy-1986}, @cite{padgett-1995}, and @cite{mccarthy-1994} all face the same trade-off: dividing consonants into co-occurrence classes either ignores robust within-class variation (broad classes — many exceptions) or fragments the data into ad-hoc sub-classes (narrow classes — many missing generalisations). FPB resolve the trade-off with a single gradient constraint based on the natural-classes similarity metric (eq. 7):

Similarity(a, b) = SharedClasses(a, b) / (SharedClasses + NonShared)

restricted to natural classes containing a place feature. Identical consonants have similarity 1; non-homorganic consonants have similarity 0. The metric is sensitive to inventory contrast: larger, more divided classes (e.g. coronals) yield lower similarity for any pair within them than smaller classes (e.g. labials), exactly capturing the empirical distance between the strong coronal OCP and the weaker guttural/dorsal OCPs.

What this file formalises #

What this file does not formalise #

Connection to ForbiddenPairs.lean's design boundary #

Linglib/Core/Computability/Subregular/ForbiddenPairs.lean (the substrate file for tier-based strictly 2-local grammars defined by forbidden-pair relations) cites FPB in its design-boundary section as the empirical motivation for "single-tier TSL_2 cannot capture gradient similarity-based OCP." Pre-this-file, that citation lived only in docstring prose. Two declarations below make the claim Lean-formal: (1) thresholdedTSL instantiates the substrate's TSLGrammar.ofForbiddenPairs with R := λ x y => similarity xs x y ≥ t, giving a real (not metaphorical) TSL_2 grammar over Arabic; and (2) thresholdedTSL_pair_iff proves the grammar's accept/reject decision on labial pairs is exactly the binary step function similarity < t. The categorical_fails_three_test_points divergence theorem then witnesses that no such two-valued model can match three specific Table IV bins.

Connection to Hansson2010.lean #

@cite{hansson-2010} (Phenomena/Phonology/Studies/Hansson2010.lean) cites FPB at line 76 in its design-boundary section on similarity-graded transparency: cases where intervening segments behave differently depending on similarity to the harmonising pair cannot be captured by single-tier TSL with a fixed tier predicate. This file is the load-bearing instance of that observation for Arabic OCP-Place specifically.

Why this paper anchors a study file #

Per CLAUDE.md's anchoring discipline: every Lean file is anchored to exactly one of (a) a specific paper, (b) a documented empirical pattern, or (c) a named theoretical framework. The audit on ForbiddenPairs.lean (committed 0.230.508, hash 8579b346) flagged FPB as one of four "silent divergences" — a paper used as a substrate file's docstring example without itself being anchored. This file closes that finding by anchoring FPB to its primary phenomenon (Phonology) with a Lean-formal divergence theorem connecting back to the substrate citation.

§ 1: The Arabic Consonant Inventory (FPB feature matrix (8), p. 201) #

The 28-segment Arabic consonant inventory used by @cite{frisch-pierrehumbert-broe-2004} eq. 8 (p. 201). The labelling follows the paper's IPA-with-superscript-ˁ-for-emphatic convention.

This file's worked examples and divergence theorem use only the labial sub-inventory {b, f, m, w}; the full 28-segment list is included to keep the inventory anchored to FPB's feature matrix and to make the file extensible to future enumeration work over the full inventory.

  • b : Arabic

    /b/ — voiced labial stop.

  • f : Arabic

    /f/ — voiceless labial fricative.

  • m : Arabic

    /m/ — labial nasal.

  • t : Arabic

    /t/ — voiceless coronal stop.

  • d : Arabic

    /d/ — voiced coronal stop.

  • tEmph : Arabic

    /tˁ/ — emphatic voiceless coronal stop.

  • dEmph : Arabic

    /dˁ/ — emphatic voiced coronal stop.

  • theta : Arabic

    /θ/ — voiceless coronal fricative.

  • edh : Arabic

    /ð/ — voiced coronal fricative.

  • s : Arabic

    /s/ — voiceless coronal sibilant.

  • z : Arabic

    /z/ — voiced coronal sibilant.

  • sEmph : Arabic

    /sˁ/ — emphatic voiceless coronal sibilant.

  • zEmph : Arabic

    /zˁ/ — emphatic voiced coronal sibilant.

  • esh : Arabic

    /ʃ/ — voiceless palatoalveolar sibilant.

  • k : Arabic

    /k/ — voiceless dorsal stop.

  • g : Arabic

    /g/ — voiced dorsal stop.

  • q : Arabic

    /q/ — uvular stop (dorsal+pharyngeal in FPB's analysis).

  • chi : Arabic

    /χ/ — voiceless uvular fricative.

  • gamma : Arabic

    /ʁ/ — voiced uvular fricative.

  • hbar : Arabic

    /ħ/ — voiceless pharyngeal fricative.

  • ayin : Arabic

    /ʕ/ — voiced pharyngeal fricative.

  • h : Arabic

    /h/ — voiceless laryngeal fricative.

  • glottal : Arabic

    /ʔ/ — laryngeal stop.

  • l : Arabic

    /l/ — coronal lateral.

  • r : Arabic

    /r/ — coronal rhotic.

  • n : Arabic

    /n/ — coronal nasal.

  • w : Arabic

    /w/ — labial-velar glide.

  • j : Arabic

    /j/ — palatal glide.

Instances For
    @[implicit_reducible]
    Equations
    Equations
    • One or more equations did not get rendered due to their size.
    Instances For
      @[implicit_reducible]
      Equations

      § 2: Labial Natural Classes — Per-Pair Enumerations from FPB p. 199 #

      Per-pair enumerations vs unified lattice #

      @cite{frisch-pierrehumbert-broe-2004} p. 199 enumerates the labial natural classes relevant for two specific worked examples:

      This difference is most likely a paper enumeration error, not a principled per-pair-relativised derivation: under @cite{broe-1993}'s specification-theory lattice the relevant labial natural classes are inventory-determined (closed under intersection of feature-class extensions), so once the lattice is fixed the set of classes containing f is fixed too. {f} would therefore appear in both enumerations under any consistent application of Broe's machinery, giving total = 9 for /b, f/ and similarity 3/9 ≈ 0.33 (rather than the paper's 3/8 = 0.38). We reproduce the paper's two enumerations faithfully so the worked- example similarity values match the paper's exact numbers (2/9 and 3/8); a unified-lattice derivation that would give the systematic 3/9 result requires substrate work (Broe 1993 specification theory) and is deferred.

      Labial natural classes relevant for the /f, m/ similarity computation, reproducing FPB p. 199 verbatim. The 2 shared classes appear first; the 7 non-shared follow. Each class is annotated with the phonological gloss the paper provides.

      Equations
      • One or more equations did not get rendered due to their size.
      Instances For

        Labial natural classes relevant for the /b, f/ similarity computation, reproducing FPB p. 199 verbatim. The 3 shared classes appear first; the 5 non-shared follow.

        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          § 3: The Natural-Classes Similarity Metric (FPB eq. 7, p. 198) #

          The Nat-valued count of natural classes in xs containing both a and b. The numerator of FPB eq. 7.

          Equations
          Instances For

            The Nat-valued count of natural classes in xs containing at least one of a, b — equivalently, shared + non-shared. The denominator of FPB eq. 7.

            Equations
            • One or more equations did not get rendered due to their size.
            Instances For

              FPB eq. 7: the natural-classes similarity metric.

              Similarity(a, b) = |classes containing both| / |classes containing a or b|

              Generic in xs : List (Finset Arabic), the list of relevant natural classes (containing a place feature, per FPB's stipulation on p. 198).

              Identical consonants participating in any class get similarity 1 (every relevant class containing one contains the other). Non-homorganic consonants get similarity 0 (no relevant labial class contains either, so total = 0 → return 0 by convention to avoid 0/0).

              Equations
              • One or more equations did not get rendered due to their size.
              Instances For

                § 3 worked examples (FPB p. 199) #

                /f, m/ share 2 labial natural classes ({b, f, m, w}, {b, f, m}) per FPB p. 199.

                /f, m/ have 9 total relevant classes (2 shared + 7 non-shared) per FPB p. 199.

                /b, f/ share 3 labial natural classes ({b, f, m, w}, {b, f, m}, {b, f}) per FPB p. 199.

                /b, f/ have 8 total relevant classes (3 shared + 5 non-shared) per FPB p. 199.

                FPB worked example (p. 199): similarity(/f, m/) = 2/9.

                /f, m/ share 2 labial natural classes (the labials, labial consonants) and have 7 non-shared (the obstruents {b, f}, the continuants {f, w}, the voiceless continuants {f}, the voiced labials {b, m, w}, the voiced stops {b, m}, the voiced sonorants {m, w}, the nasals {m}). Similarity = 2 / (2 + 7) = 2/9 ≈ 0.22.

                FPB worked example (p. 199): similarity(/b, f/) = 3/8.

                /b, f/ share 3 labial natural classes (the labials, labial consonants, obstruents) and have 5 non-shared ({f, w}, {b, m, w}, {b, m}, {b, w}, {b}). Similarity = 3 / (3 + 5) = 3/8 = 0.375.

                § 4: Empirical Table IV (FPB p. 203, adjacent pairs) #

                FPB Table IV (p. 203, adjacent column): the gradient O/E pattern as a function of natural-classes similarity. Each entry is (similarity-bin-midpoint, O/E). The monotonic decrease from O/E = 1.22 at similarity 0 (no constraint) down to O/E ≈ 0 at similarity ≥ 0.4 (categorical avoidance of highly similar pairs) embodies the gradient OCP claim.

                Equations
                • One or more equations did not get rendered due to their size.
                Instances For

                  § 5: Cross-Framework Divergence — Categorical Cannot Fit Table IV #

                  The categorical-at-threshold model #

                  Any TSL_2 grammar of the form Core.Computability.Subregular.TSLGrammar.ofForbiddenPairs (fun a b => similarity la b ≥ t) p makes a two-valued prediction about each consonant pair: forbidden (membership rejected) if similarity ≥ t, permitted (accepted) otherwise. As an O/E predictor, this becomes a step function: predicted O/E = c1 for similarity < t (where pairs are attested at some baseline rate) and c2 for similarity ≥ t (where pairs are categorically suppressed). Below we show that no choice of t, c1, c2 : ℚ can fit even three specific Table IV bins.

                  Categorical-at-threshold prediction: O/E = c1 if similarity < t, otherwise c2. Any TSL_2 grammar with R := (similarity ≥ t) reduces to this two-valued form.

                  Equations
                  Instances For
                    theorem Phonology.Studies.FrischPierrehumbertBroe2004.fpb_three_distinct_OE_levels :
                    122 / 100 59 / 100 122 / 100 6 / 100 59 / 100 6 / 100

                    Three Table IV bins exhibit three pairwise-distinct O/E values: similarity 0 → 1.22, similarity 0.25 → 0.59, similarity 0.55 → 0.06 (see empiricalTableIV above). Witnesses that the FPB pattern has at least three distinct response levels — more than a two-valued model can produce.

                    The substrate connection: a TSL_2 grammar over Arabic whose forbidden-pair relation is "similarity at or above threshold t", restricted to the labial tier. Instantiates Core.Computability.Subregular.TSLGrammar.ofForbiddenPairs directly, making the connection from FPB's similarity metric to the substrate machinery true by construction (rather than docstring-only).

                    Equations
                    • One or more equations did not get rendered due to their size.
                    Instances For

                      Every threshold-induced FPB grammar is TSL_2. Explicit IsTierStrictlyLocal 2 typing of the implicit complexity claim made by the thresholdedTSL constructor — the substrate-classification "any two-valued threshold model is TSL_2" is now a typed theorem rather than docstring prose.

                      BTSL_2 corollary (via IsTierStrictlyLocal.toIsBTSL in Core.Computability.Subregular.Multitier): every threshold-induced FPB grammar's stringset is in the multitier closure of strictly local languages, hence consumed by the @cite{lambert-2026} BTC framework.

                      theorem Phonology.Studies.FrischPierrehumbertBroe2004.thresholdedTSL_pair_iff (xs : List (Finset Arabic)) (t : ) (x y : Arabic) (hx : x.isLabial) (hy : y.isLabial) :
                      [x, y] (thresholdedTSL xs t).lang similarity xs x y < t

                      The TSL_2 grammar makes a binary step-function decision on labial pairs. For two labial segments x, y, the grammar thresholdedTSL xs t accepts the bigram [x, y] iff their similarity is strictly below t. This is the precise sense in which any similarity-threshold TSL_2 grammar collapses to the categoricalAtThreshold two-valued prediction: the accept/reject decision is a step function of similarity, not a graded response.

                      theorem Phonology.Studies.FrischPierrehumbertBroe2004.categorical_fails_three_test_points (t c1 c2 : ) :
                      ¬(categoricalAtThreshold t c1 c2 0 = 122 / 100 categoricalAtThreshold t c1 c2 (25 / 100) = 59 / 100 categoricalAtThreshold t c1 c2 (55 / 100) = 6 / 100)

                      Cross-framework divergence theorem: for ANY threshold t : ℚ and any pair of predicted O/E values c1, c2 : ℚ, the categorical-at-threshold model cannot match all three of FPB Table IV's data points (sim=0, O/E=1.22), (sim=0.25, O/E=0.59), (sim=0.55, O/E=0.06).

                      Significance: any similarity-threshold TSL_2 grammar (per thresholdedTSL, instantiating @cite{heinz-rawal-tanner-2011}'s substrate) makes a binary step-function decision on each pair (per thresholdedTSL_pair_iff), so its O/E prediction collapses to two values: one for permitted (similarity < t) pairs, one for forbidden (similarity ≥ t) pairs. This theorem proves no such two-valued model can reproduce three pairwise-distinct Table IV bins exactly. It is a necessary consequence of FPB's gradient claim, not a full R² comparison: FPB's actual argument is about aggregate fit quality across 9 bins (Categorical R² = 0.70 vs Natural Classes R² = 0.75; FPB Table V p. 207) and requires the lexical corpus to formalise. The 3-bin separation captured here is the corpus-free Lean-formal version.

                      The natural downstream extension of FPB's gradient observation is the weighted-constraint MaxEnt phonotactic learner of @cite{hayes-wilson-2008}, which uses similarity-relevant features as primitives and reproduces gradient phonotactic patterns by fitting log-linear weights — a positive-fit complement to this file's negative-fit (categorical-fails) result.

                      Proof strategy: case analysis on whether the threshold t lies in each of the four intervals partitioned by 0, 0.25, 0.55. In each case, the three predictions are some constant pattern over {c1, c2}, and linarith discharges the resulting numerical impossibility (the three required O/E values are pairwise distinct).