Documentation

Linglib.Phenomena.Morphology.Studies.AckermanMalouf2013

@cite{ackerman-malouf-2013}: The Low Conditional Entropy Conjecture #

@cite{ackerman-malouf-2013} @cite{carstairs-mccarthy-2010}

E-complexity vs. I-complexity #

Languages differ dramatically in their enumerative complexity (E-complexity): how many inflection classes, allomorphic variants, and paradigm cells they have. But this apparent complexity is misleading. The key question is integrative complexity (I-complexity): given that a speaker knows some forms of a lexeme, how hard is it to predict the rest?

The LCEC #

The Low Conditional Entropy Conjecture states that the average conditional entropy of paradigm cells — how uncertain you are about one cell given another — is uniformly low across typologically diverse languages, regardless of E-complexity. Formally:

I-complexity(L) = (1 / n(n-1)) · Σᵢ≠ⱼ H(Cᵢ | Cⱼ)

is low for all natural languages L, where Cᵢ ranges over paradigm cells and H(Cᵢ | Cⱼ) is the conditional entropy of cell i given cell j.

Structure #

noncomputable def Morphology.WP.iComplexity {n : } (ps : Core.Morphology.ParadigmSystem n String) :

@cite{ackerman-malouf-2013}'s integrative complexity: average conditional cell entropy across all off-diagonal cell pairs.

iComplexity(L) = (1 / n(n-1)) · Σᵢ≠ⱼ H(Cᵢ | Cⱼ)

Instantiated at Form := String since A&M's paradigms are over natural-language surface forms.

Equations
  • One or more equations did not get rendered due to their size.
Instances For
    def Morphology.WP.LCECHolds {n : } (ps : Core.Morphology.ParadigmSystem n String) (threshold : ) :

    The Low Conditional Entropy Conjecture: i-complexity is bounded by a small threshold. The threshold is empirical (typically ≤ 1 nat).

    Equations
    Instances For

      Summary statistics for a language's morphological paradigm system, as reported in published studies.

      Fields correspond to Tables 2--3 of @cite{ackerman-malouf-2013}.

      • name : String

        Language name

      • family : String

        Language family

      • numClasses :

        Number of inflection classes (E-complexity)

      • numCells :

        Number of paradigm cells

      • avgCondEntropy :

        Average conditional entropy H(Ci|Cj) in bits (I-complexity)

      • maxCellEntropy :

        Maximum cell entropy max H(Ci) in bits

      Instances For
        Equations
        • One or more equations did not get rendered due to their size.
        Instances For

          Fur (Nilo-Saharan, Fur; Sudan). 4 classes, 2 cells.

          Equations
          Instances For

            Ngiti (Nilo-Saharan, Central Sudanic; DRC). 8 classes, 2 cells.

            Equations
            Instances For

              Nuer (Nilo-Saharan, Nilotic; Sudan/South Sudan). 31 classes, 4 cells.

              Equations
              Instances For

                Kwerba (Trans-New Guinea; Papua, Indonesia). 2 classes, 2 cells.

                Equations
                Instances For

                  Chinantec (Oto-Manguean; Oaxaca, Mexico). 62 classes, 4 cells. Comaltepec Chinantec tonal verb paradigms.

                  Equations
                  • One or more equations did not get rendered due to their size.
                  Instances For

                    Chiquihuitlan Mazatec (Oto-Manguean; Oaxaca, Mexico). 109 classes, 4 cells. The paper's primary case study (section 4).

                    Equations
                    • One or more equations did not get rendered due to their size.
                    Instances For

                      Finnish (Uralic, Finnic). 51 classes, 8 cells.

                      Equations
                      Instances For

                        German (Indo-European, Germanic). 7 classes, 8 cells.

                        Equations
                        Instances For

                          Russian (Indo-European, Slavic). 8 classes, 8 cells.

                          Equations
                          Instances For

                            Spanish (Indo-European, Romance). 3 classes, 57 cells.

                            Equations
                            Instances For

                              All 10 languages in the @cite{ackerman-malouf-2013} sample (Table 3).

                              Equations
                              • One or more equations did not get rendered due to their size.
                              Instances For

                                The LCEC threshold: all 10 languages fall below 1 bit of average conditional entropy. Even the most complex system (Mazatec, 109 classes) has I-complexity < 1 bit.

                                Equations
                                Instances For

                                  Expected I-complexity under random class assignment for Mazatec (Monte Carlo baseline). The paper reports the mean of 1000 random permutations as ~5.25 bits, far above the observed 0.709 bits.

                                  Equations
                                  Instances For

                                    Each language's reported I-complexity is below the 1-bit threshold. These are "per-datum verification theorems" in linglib's sense: changing a language's avgCondEntropy breaks exactly the corresponding theorem.

                                    The LCEC's key prediction: E-complexity and I-complexity are dissociated. A language can have enormous E-complexity but low I-complexity.

                                    Mazatec has maximal E-complexity in the sample (109 classes).

                                    Mazatec's I-complexity is still below 1 bit despite 109 classes.

                                    Kwerba has minimal E-complexity (2 classes) but its I-complexity is not the lowest — German (7 classes) has lower I-complexity. This shows E-complexity doesn't predict I-complexity in either direction.

                                    Spanish has only 3 classes but 57 cells — yet its I-complexity is the lowest in the sample (0.003 bits). More cells with fewer classes means more implicative structure.

                                    The Mazatec case study (§4 of the paper) demonstrates that the observed I-complexity is far below what random assignment of inflection-class patterns would produce.

                                    Mazatec's observed I-complexity is far below the random baseline. Observed: 0.709 bits. Random permutation baseline: ~5.25 bits. The observed value is less than 14% of the random baseline.

                                    The ratio of observed to random I-complexity is less than 1/7. (0.709 / 5.25 ≈ 0.135, i.e., ~13.5% of random)

                                    Mazatec has nonzero I-complexity: it violates @cite{carstairs-mccarthy-2010}'s synonymy avoidance but satisfies the LCEC. This witnesses that the LCEC is strictly weaker than synonymy avoidance.