Documentation

Linglib.Studies.AckermanMalouf2013

[AM13a]: The Low Conditional Entropy Conjecture #

[AM13a] [CMC10]

E-complexity vs. I-complexity #

Languages differ dramatically in their enumerative complexity (E-complexity): how many inflection classes, allomorphic variants, and paradigm cells they have. But this apparent complexity is misleading. The key question is integrative complexity (I-complexity): given that a speaker knows some forms of a lexeme, how hard is it to predict the rest?

The LCEC #

The Low Conditional Entropy Conjecture states that the average conditional entropy of paradigm cells — how uncertain you are about one cell given another — is uniformly low across typologically diverse languages, regardless of E-complexity. Formally:

I-complexity(L) = (1 / n(n-1)) · Σᵢ≠ⱼ H(Cᵢ | Cⱼ)

is low for all natural languages L, where Cᵢ ranges over paradigm cells and H(Cᵢ | Cⱼ) is the conditional entropy of cell i given cell j.

Structure #

noncomputable def Morphology.WP.iComplexity {n : } (ps : ParadigmSystem n String) :

[AM13a]'s integrative complexity: average conditional cell entropy across all off-diagonal cell pairs.

iComplexity(L) = (1 / n(n-1)) · Σᵢ≠ⱼ H(Cᵢ | Cⱼ)

Instantiated at Form := String since A&M's paradigms are over natural-language surface forms.

Equations
  • One or more equations did not get rendered due to their size.
Instances For
    def Morphology.WP.LCECHolds {n : } (ps : ParadigmSystem n String) (threshold : ) :

    The Low Conditional Entropy Conjecture: i-complexity is bounded by a small threshold. The threshold is empirical (typically ≤ 1 nat).

    Equations
    Instances For

      Summary statistics for a language's morphological paradigm system, as reported in published studies.

      Fields correspond to Tables 2--3 of [AM13a].

      • name : String

        Language name

      • family : String

        Language family

      • numClasses :

        Number of inflection classes (E-complexity)

      • numCells :

        Number of paradigm cells

      • avgCondEntropy :

        Average conditional entropy H(Ci|Cj) in bits (I-complexity)

      • maxCellEntropy :

        Maximum cell entropy max H(Ci) in bits

      Instances For
        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          Fur (Nilo-Saharan, Fur; Sudan). 4 classes, 2 cells.

          Equations
          • AckermanMalouf2013.fur = { name := "Fur", family := "Nilo-Saharan", numClasses := 4, numCells := 2, avgCondEntropy := 489 / 1000, maxCellEntropy := 1334 / 1000 }
          Instances For

            Ngiti (Nilo-Saharan, Central Sudanic; DRC). 8 classes, 2 cells.

            Equations
            • AckermanMalouf2013.ngiti = { name := "Ngiti", family := "Nilo-Saharan", numClasses := 8, numCells := 2, avgCondEntropy := 380 / 1000, maxCellEntropy := 1741 / 1000 }
            Instances For

              Nuer (Nilo-Saharan, Nilotic; Sudan/South Sudan). 31 classes, 4 cells.

              Equations
              • AckermanMalouf2013.nuer = { name := "Nuer", family := "Nilo-Saharan", numClasses := 31, numCells := 4, avgCondEntropy := 513 / 1000, maxCellEntropy := 3224 / 1000 }
              Instances For

                Kwerba (Trans-New Guinea; Papua, Indonesia). 2 classes, 2 cells.

                Equations
                • AckermanMalouf2013.kwerba = { name := "Kwerba", family := "Trans-New Guinea", numClasses := 2, numCells := 2, avgCondEntropy := 469 / 1000, maxCellEntropy := 529 / 1000 }
                Instances For

                  Chinantec (Oto-Manguean; Oaxaca, Mexico). 62 classes, 4 cells. Comaltepec Chinantec tonal verb paradigms.

                  Equations
                  • AckermanMalouf2013.chinantec = { name := "Chinantec", family := "Oto-Manguean", numClasses := 62, numCells := 4, avgCondEntropy := 426 / 1000, maxCellEntropy := 4266 / 1000 }
                  Instances For

                    Chiquihuitlan Mazatec (Oto-Manguean; Oaxaca, Mexico). 109 classes, 4 cells. The paper's primary case study (section 4).

                    Equations
                    • AckermanMalouf2013.mazatec = { name := "Chiquihuitlan Mazatec", family := "Oto-Manguean", numClasses := 109, numCells := 4, avgCondEntropy := 709 / 1000, maxCellEntropy := 5248 / 1000 }
                    Instances For

                      Finnish (Uralic, Finnic). 51 classes, 8 cells.

                      Equations
                      • AckermanMalouf2013.finnish = { name := "Finnish", family := "Uralic", numClasses := 51, numCells := 8, avgCondEntropy := 209 / 1000, maxCellEntropy := 3803 / 1000 }
                      Instances For

                        German (Indo-European, Germanic). 7 classes, 8 cells.

                        Equations
                        • AckermanMalouf2013.german = { name := "German", family := "Indo-European", numClasses := 7, numCells := 8, avgCondEntropy := 45 / 1000, maxCellEntropy := 1906 / 1000 }
                        Instances For

                          Russian (Indo-European, Slavic). 8 classes, 8 cells.

                          Equations
                          • AckermanMalouf2013.russian = { name := "Russian", family := "Indo-European", numClasses := 8, numCells := 8, avgCondEntropy := 89 / 1000, maxCellEntropy := 2170 / 1000 }
                          Instances For

                            Spanish (Indo-European, Romance). 3 classes, 57 cells.

                            Equations
                            • AckermanMalouf2013.spanish = { name := "Spanish", family := "Indo-European", numClasses := 3, numCells := 57, avgCondEntropy := 3 / 1000, maxCellEntropy := 1522 / 1000 }
                            Instances For

                              All 10 languages in the [AM13a] sample (Table 3).

                              Equations
                              • One or more equations did not get rendered due to their size.
                              Instances For

                                The LCEC threshold: all 10 languages fall below 1 bit of average conditional entropy. Even the most complex system (Mazatec, 109 classes) has I-complexity < 1 bit.

                                Equations
                                Instances For

                                  Expected I-complexity under random class assignment for Mazatec (Monte Carlo baseline). The paper reports the mean of 1000 random permutations as ~5.25 bits, far above the observed 0.709 bits.

                                  Equations
                                  Instances For

                                    Each language's reported I-complexity is below the 1-bit threshold. These are "per-datum verification theorems" in linglib's sense: changing a language's avgCondEntropy breaks exactly the corresponding theorem.

                                    The LCEC's key prediction: E-complexity and I-complexity are dissociated. A language can have enormous E-complexity but low I-complexity.

                                    Mazatec has maximal E-complexity in the sample (109 classes).

                                    Mazatec's I-complexity is still below 1 bit despite 109 classes.

                                    Kwerba has minimal E-complexity (2 classes) but its I-complexity is not the lowest — German (7 classes) has lower I-complexity. This shows E-complexity doesn't predict I-complexity in either direction.

                                    Spanish has only 3 classes but 57 cells — yet its I-complexity is the lowest in the sample (0.003 bits). More cells with fewer classes means more implicative structure.

                                    The Mazatec case study (§4 of the paper) demonstrates that the observed I-complexity is far below what random assignment of inflection-class patterns would produce.

                                    Mazatec's observed I-complexity is far below the random baseline. Observed: 0.709 bits. Random permutation baseline: ~5.25 bits. The observed value is less than 14% of the random baseline.

                                    The ratio of observed to random I-complexity is less than 1/7. (0.709 / 5.25 ≈ 0.135, i.e., ~13.5% of random)

                                    Mazatec has nonzero I-complexity: it violates [CMC10]'s synonymy avoidance but satisfies the LCEC. This witnesses that the LCEC is strictly weaker than synonymy avoidance.