Word Reuse and Combination Support Efficient Communication #

@cite{xu-etal-2024}

Xu, A., Kemp, C., Frermann, L., & Xu, Y. (2024). Word reuse and combination support efficient communication of emerging concepts. PNAS 121(46), e2406971121.

Empirical contributions #

Using WordNet data from English, French, and Finnish (1900–2000):

Both reuse items and compounds sit near the Pareto frontier of communicative efficiency (Fig. 2).
Attested encodings are more efficient than random and near-synonym baselines (Fig. 3).
Literal items (hyponymic reuse, endocentric compounds) tend to be more efficient than nonliteral counterparts (paper §3.2; significant for French and Finnish reuse, with English reuse supplemented by compound head words because WordNet does not directly classify English-reuse literality).
Reuse items tend shorter than compounds across all three languages; compounds tend more informative than reuse items in English and French only (paper §3.3 — Finnish does not show the informativeness asymmetry).

Connection to polysemy #

Word reuse is a polysemy-generating process: when mouse acquires the sense "computer peripheral", the word becomes polysemous. This study provides an information-theoretic account of why productive polysemy exists — it is communicatively efficient under a tradeoff between length and listener confusion. Bridges Phenomena.Polysemy.Studies.Gotham2017 (synchronic copredication judgments) to a diachronic functional account.

§1. Example data (Table 1) #

source

def Phenomena.Polysemy.englishReuse :

List Diachronic.Lexicalization.FormConceptPair

English reuse items from paper Table 1.

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def Phenomena.Polysemy.englishCompounds :

List Diachronic.Lexicalization.FormConceptPair

English compounds from paper Table 1.

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def Phenomena.Polysemy.frenchReuse :

List Diachronic.Lexicalization.FormConceptPair

French reuse items.

Equations

One or more equations did not get rendered due to their size.

Instances For

source

def Phenomena.Polysemy.frenchCompounds :

List Diachronic.Lexicalization.FormConceptPair

French compounds.

Equations

One or more equations did not get rendered due to their size.

Instances For

§2. Strategy properties (verified on example data) #

The full Pareto-efficiency claims (Figs. 2–3) depend on a fitted sentence-encoder embedding for the listener's prototype distribution (paper §5.3) and 100,000 random/near-synonym baseline encodings per language–interval cell (paper §5.5); they are not reduced to decide-checkable form here. The claims that ARE decide-checkable on the Table-1 examples are about word length — the speaker-effort axis.

source

theorem Phenomena.Polysemy.english_reuse_shorter :

(List.map (fun (x : Diachronic.Lexicalization.FormConceptPair) => x.formLength) englishReuse).sum / englishReuse.length < (List.map (fun (x : Diachronic.Lexicalization.FormConceptPair) => x.formLength) englishCompounds).sum / englishCompounds.length

Reuse items are shorter on average than compounds (paper §3.3: holds across all three languages and all time intervals).

source

theorem Phenomena.Polysemy.french_reuse_shorter :

(List.map (fun (x : Diachronic.Lexicalization.FormConceptPair) => x.formLength) frenchReuse).sum / frenchReuse.length < (List.map (fun (x : Diachronic.Lexicalization.FormConceptPair) => x.formLength) frenchCompounds).sum / frenchCompounds.length

French reuse items are also shorter on average than French compounds.

source

theorem Phenomena.Polysemy.both_literalities :

Both strategies include literal and nonliteral items in the paper's Table 1 sample.

§3. Substrate witnesses #

Concrete instantiations of encodingCosts and unifiedObjective demonstrate the Theory-layer substrate is operationally consumed. The toy needProb and model below are not the paper's actual fitted distributions; they are stipulated only to anchor the type-checking. Real instantiation requires the WordNet+Sentence-BERT pipeline of paper §5.3 + §5.4.

source

noncomputable def Phenomena.Polysemy.uniformNeed (n : ℕ) :

String → ℝ

A toy uniform-need distribution: 1/n for each concept in a list-derived encoding, 0 elsewhere. Constant function for the witness; serious use would derive from corpus frequencies.

Equations

Phenomena.Polysemy.uniformNeed n x✝ = 1 / ↑n

Instances For

source

noncomputable def Phenomena.Polysemy.stipulatedModel :

Pragmatics.Communication.AsymmetricCommModel String String

A toy symmetric communication model with constant listener score 1/2. Makes encodingCosts.cost₂ a determinate value for the witness theorems below.

Equations

Phenomena.Polysemy.stipulatedModel = Pragmatics.Communication.AsymmetricCommModel.symmetric fun (x x_1 : String) => 1 / 2

Instances For

source

noncomputable def Phenomena.Polysemy.englishReuseCosts :

Pragmatics.Efficiency.CostPair

The English-reuse encoding's costs under the toy model.

Equations

One or more equations did not get rendered due to their size.

Instances For

source

noncomputable def Phenomena.Polysemy.englishCompoundsCosts :

Pragmatics.Efficiency.CostPair

The English-compound encoding's costs under the toy model.

Equations

One or more equations did not get rendered due to their size.

Instances For

source

theorem Phenomena.Polysemy.unifiedObjective_eq_weightedCost (pairs : List Diachronic.Lexicalization.FormConceptPair) (np : String → ℝ) (m : Pragmatics.Communication.AsymmetricCommModel String String) (β : ℝ) :

Diachronic.Lexicalization.unifiedObjective pairs np m β = Pragmatics.Efficiency.weightedCost (Diachronic.Lexicalization.encodingCosts pairs np m) β

unifiedObjective decomposes into weightedCost (encodingCosts ...) β. This rfl theorem witnesses that the named unifiedObjective hook is the β-scalarization of the cost pair the substrate computes — no extra arithmetic, no glue.

§4. Reuse as polysemy generation #

source

def Phenomena.Polysemy.reuseIsPolysemyGeneration :

List Diachronic.Lexicalization.FormConceptPair → List String

Word reuse creates polysemy: the reused word acquires a new sense alongside its existing one. Connects the diachronic process of lexicalization to the synchronic phenomenon of polysemy.

The copredication data in Phenomena.Polysemy.Studies.Gotham2017 captures the synchronic consequence of reuse (multiple aspects coexist); this paper's account explains the diachronic cause (efficiency pressure).

Caveat on the Gotham bridge. Xu's reuse polysemy and Gotham's logical polysemy are not the same phenomenon. Gotham's DotType requires sortally-compatible aspects with a shared individuation ground (book = phys × info, both individuating one volume); Xu's mouse → peripheral generates two unrelated sortal categories with no shared ground. The honest bridge: Xu's literal reuse (hyponymic, e.g. car narrowed from wheeled cart) is Gotham-compatible (shared ground); Xu's non-literal reuse (metaphorical, e.g. mouse) is not. The Literality enum in the Theory file is the partition this distinction lives on.

Equations

One or more equations did not get rendered due to their size.

Instances For

source

theorem Phenomena.Polysemy.all_english_reuse_creates_polysemy :

(reuseIsPolysemyGeneration englishReuse).length = englishReuse.length

All reuse items in the English data generate polysemy.

Documentation

Linglib.Phenomena.Polysemy.Studies.XuEtAl2024

Word Reuse and Combination Support Efficient Communication #

Empirical contributions #

Connection to polysemy #

§1. Example data (Table 1) #

§2. Strategy properties (verified on example data) #

§3. Substrate witnesses #

§4. Reuse as polysemy generation #