Documentation

Linglib.Studies.XuEtAl2024

Word Reuse and Combination Support Efficient Communication #

Xu, A., Kemp, C., Frermann, L., & Xu, Y. (2024). Word reuse and combination support efficient communication of emerging concepts. PNAS 121(46), e2406971121.

Empirical contributions #

Using WordNet data from English, French, and Finnish (1900–2000):

Both reuse items and compounds sit near the Pareto frontier of communicative efficiency (Fig. 2).
Attested encodings are more efficient than random and near-synonym baselines (Fig. 3).
Literal items (hyponymic reuse, endocentric compounds) tend to be more efficient than nonliteral counterparts (paper §3.2; significant for French and Finnish reuse, with English reuse supplemented by compound head words because WordNet does not directly classify English-reuse literality).
Reuse items tend shorter than compounds across all three languages; compounds tend more informative than reuse items in English and French only (paper §3.3 — Finnish does not show the informativeness asymmetry).

Connection to polysemy #

Word reuse is a polysemy-generating process: when mouse acquires the sense "computer peripheral", the word becomes polysemous. This study provides an information-theoretic account of why productive polysemy exists — it is communicatively efficient under a tradeoff between length and listener confusion. Bridges synchronic copredication judgments ([Ash11], [Got17]) to a diachronic functional account.

§0. Lexicalization substrate #

The information-theoretic model of lexicalization: novel concepts enter the lexicon either by reuse (an existing word picks up a new sense — mouse → computer peripheral) or by compounding (concatenation of existing words — spreadsheet). Both strategies are shaped by the same tradeoff between speaker effort (word length) and information loss (listener confusion).

This is a model of innovation spread under variation, not a synchronic optimization of a static lexicon. [XKFX24] §1–§2 ground in the variation-theory tradition ([weinreich-labov-herzog-1968], [milroy-milroy-1985], [labov-2011]): there is a spread interval [t₁, t₂] during which only some members of the speech community have acquired the new encoding E*. The speaker's production policy is conditioned on the expanded lexicon L' (= L ∪ E*); the listener's interpretation is conditioned on the existing lexicon L. This asymmetry is the diachronic content — it lives at the type level via Pragmatics.Communication.AsymmetricCommModel (its produce and comprehend are independent functions that may disagree).

inductive XuEtAl2024.Polysemy.Strategy :

Strategy by which a novel concept enters the lexicon ([XKFX24] Table 1: reuse items R vs. compounds C).

The paper notes that borrowing (e.g., tofu) and coinage (e.g., quark) are additional lexicalization strategies excluded from its scope; this enum mirrors that scope restriction.

reuse : Strategy
Reuse an existing word for a new meaning. E.g., mouse (rodent → peripheral), dish (plate → antenna).
compound : Strategy
Concatenate existing words into a compound. E.g., birthday card, spreadsheet, urban renewal.

Instances For

@[implicit_reducible]

instance XuEtAl2024.Polysemy.instDecidableEqStrategy :

DecidableEq Strategy

Equations

XuEtAl2024.Polysemy.instDecidableEqStrategy x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

@[implicit_reducible]

instance XuEtAl2024.Polysemy.instReprStrategy :

Equations

XuEtAl2024.Polysemy.instReprStrategy = { reprPrec := XuEtAl2024.Polysemy.instReprStrategy.repr }

def XuEtAl2024.Polysemy.instReprStrategy.repr :

Strategy → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

inductive XuEtAl2024.Polysemy.Literality :

Literality of the form–meaning relationship. Literal items tend to be more communicatively efficient ([XKFX24] §3.2).

This binary distinction is a coarsening of the continuous taxonomic- distance measures the paper also tests (Wu-Palmer 1994; Leacock-Chodorow-Miller 1998), reported in paper §3.2 final paragraph + SI §S5.E as monotonically correlated with efficiency loss. The enum captures the headline literal/non-literal contrast; a continuous version would parameterize over a distance metric.

literal : Literality
Form directly relates to the intended concept.
- Reuse: intended sense is a hyponym of an existing sense.
- Compound: endocentric (head = superordinate of intended concept).
nonliteral : Literality
Metaphorical or metonymic relationship.
- Reuse: e.g., mouse for computer peripheral.
- Compound: exocentric, e.g., boîte noire = flight recorder.

Instances For

@[implicit_reducible]

instance XuEtAl2024.Polysemy.instDecidableEqLiterality :

DecidableEq Literality

Equations

XuEtAl2024.Polysemy.instDecidableEqLiterality x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

def XuEtAl2024.Polysemy.instReprLiterality.repr :

Literality → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

@[implicit_reducible]

instance XuEtAl2024.Polysemy.instReprLiterality :

Repr Literality

Equations

XuEtAl2024.Polysemy.instReprLiterality = { reprPrec := XuEtAl2024.Polysemy.instReprLiterality.repr }

structure XuEtAl2024.Polysemy.FormConceptPair :

A form–concept pair in an emerging encoding (one entry in E*).

The concept field is a human-readable label. In [XKFX24] actual use, concepts are WordNet sense IDs embedded via Sentence-BERT (paper §5.3); two distinct senses can share a surface label, so serious instantiation needs disambiguating IDs. The string here is a presentation-layer convenience for example data.

form : String
concept : String
strategy : Strategy
literality : Literality

Instances For

@[implicit_reducible]

instance XuEtAl2024.Polysemy.instReprFormConceptPair :

Repr FormConceptPair

Equations

XuEtAl2024.Polysemy.instReprFormConceptPair = { reprPrec := XuEtAl2024.Polysemy.instReprFormConceptPair.repr }

def XuEtAl2024.Polysemy.instReprFormConceptPair.repr :

FormConceptPair → ℕ → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

def XuEtAl2024.Polysemy.FormConceptPair.formLength (p : FormConceptPair) :

ℕ

Orthographic form length, used as the speaker-effort proxy in [XKFX24] (paper eq. 2).

Equations

p.formLength = p.form.length

Instances For

def XuEtAl2024.Polysemy.surprisalCap :

ℝ

Per-pair surprisal cap used when model.comprehend returns a non-positive value. The paper's softmax model never produces zero listener probability, so this bound is for numeric robustness only. Default is 20 nats ≈ 28.8 bits, comfortably above attested typical information loss of ~10–15 bits ([XKFX24] Fig. 2 axes; paper uses log₂, this file uses natural log).

Equations

XuEtAl2024.Polysemy.surprisalCap = 20

Instances For

noncomputable def XuEtAl2024.Polysemy.encodingCosts (pairs : List FormConceptPair) (needProb : String → ℝ) (model : Pragmatics.Communication.AsymmetricCommModel String String) :

Pragmatics.Efficiency.CostPair

Communicative costs of an encoding under an asymmetric communication model. Cost₂ uses model.comprehend (the listener-side channel conditioned on the existing lexicon L), reflecting the diachronic asymmetry [XKFX24] introduces. The speaker-side produce channel is not consumed in the deterministic-policy case (see below) but lives in the same model so future non-deterministic versions can read it.

cost₁ (paper eq. 2): expected word length under needProb. cost₂ (paper eq. 3): expected surprisal under the listener distribution. The unweighted sum cost₂ + β · cost₁ recovers L_β (paper eq. 4); the per-pair, proportional rearrangement is paper eq. 5. We use natural log throughout, so cost₂ is in nats; multiply by 1 / Real.log 2 for the paper's bits convention.

Deterministic-policy assumption. The signature takes needProb : String → ℝ (a concept-only marginal) rather than a joint p(c, w | L'). This silently assumes a deterministic production policy — one form per concept in pairs. Paper §5.2 estimates p(c, w | L') = p(w | L') · p(c | w, L') separately for each language; under the assumption that each emerging form-sense pair appears with multiplicity 1 in E*, the joint reduces to needProb p.concept and these costs are exact. With non-deterministic encodings (multiple forms competing for one concept), use the joint instead and marginalize.

Equations

One or more equations did not get rendered due to their size.

Instances For

noncomputable def XuEtAl2024.Polysemy.unifiedObjective (pairs : List FormConceptPair) (needProb : String → ℝ) (model : Pragmatics.Communication.AsymmetricCommModel String String) (β : ℝ) :

ℝ

Combined cost (paper eq. 4): L_β = info_loss + β · effort. The β-scalarization of an encoding's CostPair; parameterizes the Pareto frontier in Pragmatics.Efficiency.

Equations

XuEtAl2024.Polysemy.unifiedObjective pairs needProb model β = Pragmatics.Efficiency.weightedCost (XuEtAl2024.Polysemy.encodingCosts pairs needProb model) β

Instances For

§1. Example data (Table 1) #

def XuEtAl2024.Polysemy.englishReuse :

List FormConceptPair

English reuse items from paper Table 1.

Equations

One or more equations did not get rendered due to their size.

Instances For

def XuEtAl2024.Polysemy.englishCompounds :

List FormConceptPair

English compounds from paper Table 1.

Equations

One or more equations did not get rendered due to their size.

Instances For

def XuEtAl2024.Polysemy.frenchReuse :

List FormConceptPair

French reuse items.

Equations

One or more equations did not get rendered due to their size.

Instances For

def XuEtAl2024.Polysemy.frenchCompounds :

List FormConceptPair

French compounds.

Equations

One or more equations did not get rendered due to their size.

Instances For

§2. Strategy properties (verified on example data) #

The full Pareto-efficiency claims (Figs. 2–3) depend on a fitted sentence-encoder embedding for the listener's prototype distribution (paper §5.3) and 100,000 random/near-synonym baseline encodings per language–interval cell (paper §5.5); they are not reduced to decide-checkable form here. The claims that ARE decide-checkable on the Table-1 examples are about word length — the speaker-effort axis.

theorem XuEtAl2024.Polysemy.english_reuse_shorter :

(List.map (fun (x : FormConceptPair) => x.formLength) englishReuse).sum / englishReuse.length < (List.map (fun (x : FormConceptPair) => x.formLength) englishCompounds).sum / englishCompounds.length

Reuse items are shorter on average than compounds (paper §3.3: holds across all three languages and all time intervals).

theorem XuEtAl2024.Polysemy.french_reuse_shorter :

(List.map (fun (x : FormConceptPair) => x.formLength) frenchReuse).sum / frenchReuse.length < (List.map (fun (x : FormConceptPair) => x.formLength) frenchCompounds).sum / frenchCompounds.length

French reuse items are also shorter on average than French compounds.

theorem XuEtAl2024.Polysemy.both_literalities :

(englishReuse.any fun (x : FormConceptPair) => x.literality == Literality.literal) = true ∧ (englishReuse.any fun (x : FormConceptPair) => x.literality == Literality.nonliteral) = true ∧ (englishCompounds.any fun (x : FormConceptPair) => x.literality == Literality.literal) = true ∧ (englishCompounds.any fun (x : FormConceptPair) => x.literality == Literality.nonliteral) = true

Both strategies include literal and nonliteral items in the paper's Table 1 sample.

§3. Substrate witnesses #

Concrete instantiations of encodingCosts and unifiedObjective demonstrate the Theory-layer substrate is operationally consumed. The toy needProb and model below are not the paper's actual fitted distributions; they are stipulated only to anchor the type-checking. Real instantiation requires the WordNet+Sentence-BERT pipeline of paper §5.3 + §5.4.

noncomputable def XuEtAl2024.Polysemy.uniformNeed (n : ℕ) :

String → ℝ

A toy uniform-need distribution: 1/n for each concept in a list-derived encoding, 0 elsewhere. Constant function for the witness; serious use would derive from corpus frequencies.

Equations

XuEtAl2024.Polysemy.uniformNeed n x✝ = 1 / ↑n

Instances For

noncomputable def XuEtAl2024.Polysemy.stipulatedModel :

Pragmatics.Communication.AsymmetricCommModel String String

A toy symmetric communication model with constant listener score 1/2. Makes encodingCosts.cost₂ a determinate value for the witness theorems below.

Equations

XuEtAl2024.Polysemy.stipulatedModel = Pragmatics.Communication.AsymmetricCommModel.symmetric fun (x x_1 : String) => 1 / 2

Instances For

noncomputable def XuEtAl2024.Polysemy.englishReuseCosts :

Pragmatics.Efficiency.CostPair

The English-reuse encoding's costs under the toy model.

Equations

One or more equations did not get rendered due to their size.

Instances For

noncomputable def XuEtAl2024.Polysemy.englishCompoundsCosts :

Pragmatics.Efficiency.CostPair

The English-compound encoding's costs under the toy model.

Equations

One or more equations did not get rendered due to their size.

Instances For

theorem XuEtAl2024.Polysemy.unifiedObjective_eq_weightedCost (pairs : List FormConceptPair) (np : String → ℝ) (m : Pragmatics.Communication.AsymmetricCommModel String String) (β : ℝ) :

unifiedObjective pairs np m β = Pragmatics.Efficiency.weightedCost (encodingCosts pairs np m) β

unifiedObjective decomposes into weightedCost (encodingCosts ...) β. This rfl theorem witnesses that the named unifiedObjective hook is the β-scalarization of the cost pair the substrate computes — no extra arithmetic, no glue.

§4. Reuse as polysemy generation #

def XuEtAl2024.Polysemy.reuseIsPolysemyGeneration :

List FormConceptPair → List String

Word reuse creates polysemy: the reused word acquires a new sense alongside its existing one. Connects the diachronic process of lexicalization to the synchronic phenomenon of polysemy.

Copredication ([Ash11], [Got17]) is the synchronic consequence of reuse (multiple aspects coexist); this paper's account explains the diachronic cause (efficiency pressure).

Caveat on the copredication bridge. Xu's reuse polysemy and logical polysemy are not the same phenomenon. Logical polysemy involves sortally-compatible aspects with a shared individuation ground (book = phys × info, both individuating one volume); Xu's mouse → peripheral generates two unrelated sortal categories with no shared ground. The honest bridge: Xu's literal reuse (hyponymic, e.g. car narrowed from wheeled cart) is compatible with logical polysemy (shared ground); Xu's non-literal reuse (metaphorical, e.g. mouse) is not. The Literality enum in the Theory file is the partition this distinction lives on.

Equations

One or more equations did not get rendered due to their size.

Instances For

theorem XuEtAl2024.Polysemy.all_english_reuse_creates_polysemy :

(reuseIsPolysemyGeneration englishReuse).length = englishReuse.length

All reuse items in the English data generate polysemy.