Documentation

Linglib.Studies.AckermanMalouf2013

[AM13a]: The Low Conditional Entropy Conjecture #

[AM13a] [CMC10]

E-complexity vs. I-complexity #

Languages differ dramatically in their enumerative complexity (E-complexity): how many inflection classes, allomorphic variants, and paradigm cells they have. But this apparent complexity is misleading. The key question is integrative complexity (I-complexity): given that a speaker knows some forms of a lexeme, how hard is it to predict the rest?

The LCEC #

The Low Conditional Entropy Conjecture states that the average conditional entropy of paradigm cells — how uncertain you are about one cell given another — is uniformly low across typologically diverse languages, regardless of E-complexity. Formally:

I-complexity(L) = (1 / n(n-1)) · Σᵢ≠ⱼ H(Cᵢ | Cⱼ)

is low for all natural languages L, where Cᵢ ranges over paradigm cells and H(Cᵢ | Cⱼ) is the conditional entropy of cell i given cell j.

Structure #

§0: i-Complexity (paper-specific aggregation; substrate types InflectionClass/ParadigmSystem/cellEntropy/conditionalCellEntropy live in Morphology/Paradigm.lean, hoisted there 0.230.X for shared use with [RHF26]'s informational fusion)
§1: Per-language LCEC verification (all 10 languages)
§2: E-complexity / I-complexity dissociation
§3: Mazatec case study (observed vs. random baseline)

noncomputable def Morphology.WP.iComplexity {n : ℕ} (ps : ParadigmSystem n String) :

ℝ

[AM13a]'s integrative complexity: average conditional cell entropy across all off-diagonal cell pairs.

iComplexity(L) = (1 / n(n-1)) · Σᵢ≠ⱼ H(Cᵢ | Cⱼ)

Instantiated at Form := String since A&M's paradigms are over natural-language surface forms.

Equations

One or more equations did not get rendered due to their size.

Instances For

def Morphology.WP.LCECHolds {n : ℕ} (ps : ParadigmSystem n String) (threshold : ℝ) :

The Low Conditional Entropy Conjecture: i-complexity is bounded by a small threshold. The threshold is empirical (typically ≤ 1 nat).

Equations

Morphology.WP.LCECHolds ps threshold = (Morphology.WP.iComplexity ps ≤ threshold)

Instances For

theorem Morphology.WP.transparent_iComplexity_zero {n : ℕ} (ps : ParadigmSystem n String) (h : ps.isTransparent) :

iComplexity ps = 0

structure AckermanMalouf2013.LanguageData :

Summary statistics for a language's morphological paradigm system, as reported in published studies.

Fields correspond to Tables 2--3 of [AM13a].

name : String
Language name
family : String
Language family
numClasses : ℕ
Number of inflection classes (E-complexity)
numCells : ℕ
Number of paradigm cells
avgCondEntropy : ℚ
Average conditional entropy H(Ci|Cj) in bits (I-complexity)
maxCellEntropy : ℚ
Maximum cell entropy max H(Ci) in bits

Instances For

def AckermanMalouf2013.instReprLanguageData.repr :

LanguageData → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

@[implicit_reducible]

instance AckermanMalouf2013.instReprLanguageData :

Repr LanguageData

Equations

AckermanMalouf2013.instReprLanguageData = { reprPrec := AckermanMalouf2013.instReprLanguageData.repr }

def AckermanMalouf2013.fur :

Fur (Nilo-Saharan, Fur; Sudan). 4 classes, 2 cells.

Equations

AckermanMalouf2013.fur = { name := "Fur", family := "Nilo-Saharan", numClasses := 4, numCells := 2, avgCondEntropy := 489 / 1000, maxCellEntropy := 1334 / 1000 }

Instances For

def AckermanMalouf2013.ngiti :

Ngiti (Nilo-Saharan, Central Sudanic; DRC). 8 classes, 2 cells.

Equations

AckermanMalouf2013.ngiti = { name := "Ngiti", family := "Nilo-Saharan", numClasses := 8, numCells := 2, avgCondEntropy := 380 / 1000, maxCellEntropy := 1741 / 1000 }

Instances For

def AckermanMalouf2013.nuer :

Nuer (Nilo-Saharan, Nilotic; Sudan/South Sudan). 31 classes, 4 cells.

Equations

AckermanMalouf2013.nuer = { name := "Nuer", family := "Nilo-Saharan", numClasses := 31, numCells := 4, avgCondEntropy := 513 / 1000, maxCellEntropy := 3224 / 1000 }

Instances For

def AckermanMalouf2013.kwerba :

Kwerba (Trans-New Guinea; Papua, Indonesia). 2 classes, 2 cells.

Equations

AckermanMalouf2013.kwerba = { name := "Kwerba", family := "Trans-New Guinea", numClasses := 2, numCells := 2, avgCondEntropy := 469 / 1000, maxCellEntropy := 529 / 1000 }

Instances For

def AckermanMalouf2013.chinantec :

Chinantec (Oto-Manguean; Oaxaca, Mexico). 62 classes, 4 cells. Comaltepec Chinantec tonal verb paradigms.

Equations

AckermanMalouf2013.chinantec = { name := "Chinantec", family := "Oto-Manguean", numClasses := 62, numCells := 4, avgCondEntropy := 426 / 1000, maxCellEntropy := 4266 / 1000 }

Instances For

def AckermanMalouf2013.mazatec :

Chiquihuitlan Mazatec (Oto-Manguean; Oaxaca, Mexico). 109 classes, 4 cells. The paper's primary case study (section 4).

Equations

AckermanMalouf2013.mazatec = { name := "Chiquihuitlan Mazatec", family := "Oto-Manguean", numClasses := 109, numCells := 4, avgCondEntropy := 709 / 1000, maxCellEntropy := 5248 / 1000 }

Instances For

def AckermanMalouf2013.finnish :

Finnish (Uralic, Finnic). 51 classes, 8 cells.

Equations

AckermanMalouf2013.finnish = { name := "Finnish", family := "Uralic", numClasses := 51, numCells := 8, avgCondEntropy := 209 / 1000, maxCellEntropy := 3803 / 1000 }

Instances For

def AckermanMalouf2013.german :

German (Indo-European, Germanic). 7 classes, 8 cells.

Equations

AckermanMalouf2013.german = { name := "German", family := "Indo-European", numClasses := 7, numCells := 8, avgCondEntropy := 45 / 1000, maxCellEntropy := 1906 / 1000 }

Instances For

def AckermanMalouf2013.russian :

Russian (Indo-European, Slavic). 8 classes, 8 cells.

Equations

AckermanMalouf2013.russian = { name := "Russian", family := "Indo-European", numClasses := 8, numCells := 8, avgCondEntropy := 89 / 1000, maxCellEntropy := 2170 / 1000 }

Instances For

def AckermanMalouf2013.spanish :

Spanish (Indo-European, Romance). 3 classes, 57 cells.

Equations

AckermanMalouf2013.spanish = { name := "Spanish", family := "Indo-European", numClasses := 3, numCells := 57, avgCondEntropy := 3 / 1000, maxCellEntropy := 1522 / 1000 }

Instances For

def AckermanMalouf2013.ackermanMalouf2013 :

List LanguageData

All 10 languages in the [AM13a] sample (Table 3).

Equations

One or more equations did not get rendered due to their size.

Instances For

def AckermanMalouf2013.lcecThreshold :

ℚ

The LCEC threshold: all 10 languages fall below 1 bit of average conditional entropy. Even the most complex system (Mazatec, 109 classes) has I-complexity < 1 bit.

Equations

AckermanMalouf2013.lcecThreshold = 1

Instances For

def AckermanMalouf2013.mazatecRandomBaseline :

ℚ

Expected I-complexity under random class assignment for Mazatec (Monte Carlo baseline). The paper reports the mean of 1000 random permutations as ~5.25 bits, far above the observed 0.709 bits.

Equations

AckermanMalouf2013.mazatecRandomBaseline = 525 / 100

Instances For

Each language's reported I-complexity is below the 1-bit threshold. These are "per-datum verification theorems" in linglib's sense: changing a language's avgCondEntropy breaks exactly the corresponding theorem.

theorem AckermanMalouf2013.fur_lcec :

fur.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.ngiti_lcec :

ngiti.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.nuer_lcec :

nuer.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.kwerba_lcec :

kwerba.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.chinantec_lcec :

chinantec.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.mazatec_lcec :

mazatec.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.finnish_lcec :

finnish.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.german_lcec :

german.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.russian_lcec :

russian.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.spanish_lcec :

spanish.avgCondEntropy ≤ lcecThreshold

theorem AckermanMalouf2013.all_satisfy_lcec (l : LanguageData) :

l ∈ ackermanMalouf2013 → l.avgCondEntropy ≤ lcecThreshold

All 10 languages satisfy the LCEC.

The LCEC's key prediction: E-complexity and I-complexity are dissociated. A language can have enormous E-complexity but low I-complexity.

theorem AckermanMalouf2013.mazatec_max_eComplexity (l : LanguageData) :

l ∈ ackermanMalouf2013 → l.numClasses ≤ mazatec.numClasses

Mazatec has maximal E-complexity in the sample (109 classes).

theorem AckermanMalouf2013.mazatec_high_e_low_i :

mazatec.numClasses = 109 ∧ mazatec.avgCondEntropy ≤ 1

Mazatec's I-complexity is still below 1 bit despite 109 classes.

theorem AckermanMalouf2013.eComplexity_doesnt_predict_iComplexity :

kwerba.numClasses < german.numClasses ∧ german.avgCondEntropy < kwerba.avgCondEntropy

Kwerba has minimal E-complexity (2 classes) but its I-complexity is not the lowest — German (7 classes) has lower I-complexity. This shows E-complexity doesn't predict I-complexity in either direction.

theorem AckermanMalouf2013.spanish_minimal_iComplexity (l : LanguageData) :

l ∈ ackermanMalouf2013 → spanish.avgCondEntropy ≤ l.avgCondEntropy

Spanish has only 3 classes but 57 cells — yet its I-complexity is the lowest in the sample (0.003 bits). More cells with fewer classes means more implicative structure.

The Mazatec case study (§4 of the paper) demonstrates that the observed I-complexity is far below what random assignment of inflection-class patterns would produce.

theorem AckermanMalouf2013.mazatec_well_below_random :

mazatec.avgCondEntropy < mazatecRandomBaseline

Mazatec's observed I-complexity is far below the random baseline. Observed: 0.709 bits. Random permutation baseline: ~5.25 bits. The observed value is less than 14% of the random baseline.

theorem AckermanMalouf2013.mazatec_ratio_to_random :

mazatec.avgCondEntropy * 7 < mazatecRandomBaseline

The ratio of observed to random I-complexity is less than 1/7. (0.709 / 5.25 ≈ 0.135, i.e., ~13.5% of random)

theorem AckermanMalouf2013.mazatec_violates_synonymyAvoidance :

mazatec.avgCondEntropy > 0

Mazatec has nonzero I-complexity: it violates [CMC10]'s synonymy avoidance but satisfies the LCEC. This witnesses that the LCEC is strictly weaker than synonymy avoidance.