@cite{anttila-1997}: Deriving Variation from Grammar #

Formalizes the quantitative variation predictions for Finnish genitive plurals from @cite{anttila-1997}. Anttila's claim: free variation in Finnish (and crucially, its statistical biases) is derivable from a single partially-ranked OT grammar — the variant probabilities equal the fraction of total rankings consistent with the partial ranking under which that variant wins.

The grammar #

Anttila stratifies 16 constraints into 5 mutually-ranked strata, with internal random ordering within each stratum (@cite{anttila-1997} eq. (49)–(50), page 21):

Set 1 ≫ Set 2 ≫ Set 3 ≫ Set 4 ≫ Set 5

Set 1: *X̀.X̀ (1 constraint, NoClash)
Set 2: *L̀, *H̀ (2 constraints; secondary-stress *L, *H)
Set 3: *H/I, *Í, *L.L (3 constraints)
Set 4: *H/O, *Ó, *L/A, *H.H, *X.X, *H́ (6 constraints; final constraint is *H́ acute = primary-stressed-heavy, distinct from Set 2's *H̀ grave = secondary-stressed-heavy)
Set 5: 8 lower constraints (irrelevant for the variation cases here)

Substrate consumption #

This file routes through the project's POC (Partially Ordered Constraints) substrate. For each motif, a violation-profile function vp : Input → Variant → Fin n → ℕ derives relevant (where vp disagrees on the two variants) and yesFav (where vp favors the chosen variant). pocPredict over discrete n (uniform sampling over all n! total orders) gives the variant probability; picksAt_rate_eq reduces pocPredict to |Y ∩ D| / |D| in closed form — no enumeration of n! rankings.

Two POC instances, one per stratum:

Set 3 (n = 3): motif 3ab only. Input is Unit (single motif).
Set 4 (n = 6): motifs 4ab and 5ab. Input is Set4Motif to distinguish the two motifs' violation profiles.

Note on candidate-feature substrate #

We stipulate violation profiles via vp rather than defining NamedConstraint instances. This matches Anttila's own level of abstraction: the paper works directly with violation profiles (@cite{anttila-1997} page 22: "knowing that the weak variant violates one constraint (*L.L) while the strong variant violates two (*H/I, *Í) gives us the result directly"). True NamedConstraint formalisations would require a Finnish syllable substrate (input forms with stress / weight / sonority features feeding into syllable structure) which doesn't yet exist in linglib.

Predictions formalized #

From @cite{anttila-1997} table 52 (page 22) and table 53 (page 23):

3ab (L.TÍI ∼ L.TI, e.g. naa.pu.rei.den ∼ naa.pu.ri.en): decided in Set 3 (n=3). Strong wins 1/3, weak wins 2/3. Observed: 36.9% / 63.1% (215 / 368 corpus tokens).
4ab (H.TÁA ∼ H.TA, e.g. máa.il.mòi.den ∼ máa.il.mo.jen): decided in Set 4 (n=6) with both variants violating two Set-4 constraints. Each wins 1/2. Observed: 50.5% / 49.5% (46 / 45 corpus tokens).
5ab (H.TÓO ∼ H.TO, e.g. kór.jaa.mòi.den ∼ kór.jaa.mo.jen): decided in Set 4. Strong wins 1/5, weak wins 4/5. Observed: 17.8% / 82.2% (76 / 350 corpus tokens).

Out of scope #

Categorical motifs 1ab, 2ab, 6ab. Per @cite{anttila-1997} table 52, these are decided by Set 1 (NoClash) and Set 2 (*L̀ / *H̀), which this file doesn't model. The categorical predictions follow from higher-stratum constraints decisively favoring one variant.
NamedConstraint instances for *H/I, *Í, *L.L, etc. — would require a Finnish syllable substrate (see "Note on candidate-feature substrate" above).
Observed-vs-predicted comparison theorems. The paper's table 53 shows a small gap between predicted (1/3, 2/3, 1/2, 1/2, 1/5, 4/5) and observed; this gap is empirical noise around the discrete prediction (the paper itself notes "as the quantitative predictions of our model are discrete probabilities (1/2, 1/3, 1/5 etc.) it would be difficult to get any closer", page 23).

Same closed form as @cite{zuraw-2010}, @cite{coetzee-pater-2011} #

Anttila's Finnish variation, Zuraw's Tagalog factorial typology, and Coetzee & Pater's English t/d-deletion all reduce to the same substrate predictor pocPredict (discrete n) with binary candidate spaces — variant probability = |Y ∩ D| / |D| (where D distinguishes and Y favors the chosen variant). The reusability across three phonological domains validates the abstraction; see Phenomena/Phonology/Studies/Zuraw2010.lean and Phenomena/Phonology/Studies/CoetzeePater2011.lean for sister consumers.