Documentation

Linglib.Studies.AnandHardtMcCloskey2021

[AHMcC21] — The Santa Cruz sluicing data set #

Distributional findings from the Santa Cruz sluicing data set, a 4,700-example annotated corpus of naturally occurring English sluices drawn from the New York Times subset of the Gigaword corpus, built by the Santa Cruz Ellipsis Project (SCEP).

Key Findings #

Sprouting dominates: 65.5% of sluices have no overt correlate, overturning the theoretical literature's focus on merger (34.5%).
why / Reason is the largest remnant type, not a majority. why sluices are 53.8% of sprouting and 37.2% of all sluices; the Reason semantic type (Table 1) totals 1,752/4,700 (≈37.3%) — the single most frequent remnant type, but well short of a majority of all sluices. (The "53.8% of the total" figure holds only in the paper's footnote variant that folds in why not questions, which the authors caution "should probably not be analyzed in the same terms as sluicing.")
Mismatches are systematic but bounded: tense (129), modality (394), polarity (28), and new-words (71 clear cases) mismatches between antecedent and ellipsis site are attested. Voice and argument-structure (transitivity) mismatches have zero attestations — which the paper is careful to note is consistent with great rarity, not demonstrated impossibility.
Embedding is the norm: 72.4% of sluices are embedded (3,404), vs. 27.6% root (1,296).

The paper is a descriptive data-set report; it explicitly defers theoretical interpretation to a companion paper. The Syntactic-Isomorphism-Condition bridge theorems that relate this corpus to a formal-matching account therefore live in Studies/AnandHardtMcCloskey2025.lean, not here (chronological grounding).

Antecedent–ellipsis mismatch dimensions (§5) #

The data set documents which form/interpretation mismatches between antecedent and ellipsis site are attested and which are absent. Attested mismatches challenge strict syntactic-identity requirements; the absent ones (especially argument structure) constrain theories of ellipsis licensing. These six dimensions are the §5 mismatch taxonomy — distinct from the paper's broader annotation ontology (antecedent/wh-remnant/paraphrase/correlate obligatory tags; the E-TYPE, IGNORE, ISLAND, MISSINGANTE, QEMBEDDER global tags), which is not modelled here.

inductive AnandHardtMcCloskey2021.MismatchDimension :

Dimension along which antecedent and ellipsis site can differ (§5).

tense : MismatchDimension
modality : MismatchDimension
polarity : MismatchDimension
newWords : MismatchDimension
voice : MismatchDimension
argumentStructure : MismatchDimension

Instances For

@[implicit_reducible]

instance AnandHardtMcCloskey2021.instDecidableEqMismatchDimension :

DecidableEq MismatchDimension

Equations

AnandHardtMcCloskey2021.instDecidableEqMismatchDimension x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

@[implicit_reducible]

instance AnandHardtMcCloskey2021.instReprMismatchDimension :

Repr MismatchDimension

Equations

AnandHardtMcCloskey2021.instReprMismatchDimension = { reprPrec := AnandHardtMcCloskey2021.instReprMismatchDimension.repr }

def AnandHardtMcCloskey2021.instReprMismatchDimension.repr :

MismatchDimension → Nat → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

def AnandHardtMcCloskey2021.MismatchDimension.corpusCount :

MismatchDimension → Nat

Corpus count for each mismatch dimension (§5).

The two zero counts (voice, argument structure) record non-attestation in this 4,700-example corpus. The paper (§5.5) stresses this is consistent with the mismatches being very rare rather than impossible.

Equations

Instances For

Remnant semantic-type distribution (Table 1) #

The wh-remnant semantic-type distribution is a central descriptive contribution: prior work (Ross 1969, Chung et al. 1995, Merchant 2001) focused on Entity remnants, which are only 13.8% here, while Reason (typically why) and Degree dominate.

inductive AnandHardtMcCloskey2021.RemnantSemanticType :

Semantic type of the wh-remnant (Table 1).

Instances For

@[implicit_reducible]

instance AnandHardtMcCloskey2021.instDecidableEqRemnantSemanticType :

DecidableEq RemnantSemanticType

Equations

AnandHardtMcCloskey2021.instDecidableEqRemnantSemanticType x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯

@[implicit_reducible]

instance AnandHardtMcCloskey2021.instReprRemnantSemanticType :

Repr RemnantSemanticType

Equations

AnandHardtMcCloskey2021.instReprRemnantSemanticType = { reprPrec := AnandHardtMcCloskey2021.instReprRemnantSemanticType.repr }

def AnandHardtMcCloskey2021.instReprRemnantSemanticType.repr :

RemnantSemanticType → Nat → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

def AnandHardtMcCloskey2021.RemnantSemanticType.embeddedCount :

RemnantSemanticType → Nat

Count of embedded sluices of each remnant type (Table 1, EMBEDDED column).

Equations

Instances For

def AnandHardtMcCloskey2021.RemnantSemanticType.rootCount :

RemnantSemanticType → Nat

Count of root sluices of each remnant type (Table 1, ROOT column).

Equations

Instances For

def AnandHardtMcCloskey2021.RemnantSemanticType.totalCount (t : RemnantSemanticType) :

Nat

Total count of each remnant type (Table 1, TOTAL column), derived.

Equations

t.totalCount = t.embeddedCount + t.rootCount

Instances For

Corpus summary statistics #

Percentages are stored as integer tenths (655 = 65.5%) since every downstream fact about them is an integer (in)equality; this preserves the paper's reported precision without rational arithmetic.

structure AnandHardtMcCloskey2021.CorpusSummary :

Summary statistics from the Santa Cruz sluicing data set.

totalSluices : Nat
sproutingPctTenths : Nat
mergerPctTenths : Nat
rootPctTenths : Nat
embeddedPctTenths : Nat
antecedentlessCount : Nat

Instances For

def AnandHardtMcCloskey2021.instReprCorpusSummary.repr :

CorpusSummary → Nat → Std.Format

Equations

One or more equations did not get rendered due to their size.

Instances For

@[implicit_reducible]

instance AnandHardtMcCloskey2021.instReprCorpusSummary :

Repr CorpusSummary

Equations

AnandHardtMcCloskey2021.instReprCorpusSummary = { reprPrec := AnandHardtMcCloskey2021.instReprCorpusSummary.repr }

def AnandHardtMcCloskey2021.dataSet :

The data-set summary.

Equations

AnandHardtMcCloskey2021.dataSet = { totalSluices := 4700, sproutingPctTenths := 655, mergerPctTenths := 345, rootPctTenths := 276, embeddedPctTenths := 724, antecedentlessCount := 167 }

Instances For

Corpus distribution #

theorem AnandHardtMcCloskey2021.sprouting_is_majority_kind :

dataSet.sproutingPctTenths > 500

Sprouting is the majority sluice kind (overturning the literature's focus on merger).

theorem AnandHardtMcCloskey2021.embedded_is_majority_context :

dataSet.embeddedPctTenths > 500

Embedding is the majority context.

theorem AnandHardtMcCloskey2021.sprouting_merger_exhaustive :

dataSet.sproutingPctTenths + dataSet.mergerPctTenths = 1000

The reported sprouting/merger percentages are mutually consistent (sum to 100%).

theorem AnandHardtMcCloskey2021.root_embedded_exhaustive :

dataSet.rootPctTenths + dataSet.embeddedPctTenths = 1000

The reported root/embedded percentages are mutually consistent (sum to 100%).

theorem AnandHardtMcCloskey2021.semantic_type_totals_exhaustive :

RemnantSemanticType.reason.totalCount + RemnantSemanticType.degree.totalCount + RemnantSemanticType.entity.totalCount + RemnantSemanticType.manner.totalCount + RemnantSemanticType.temporal.totalCount + RemnantSemanticType.locative.totalCount + RemnantSemanticType.classificatory.totalCount + RemnantSemanticType.other.totalCount = dataSet.totalSluices

The Table 1 semantic-type totals partition the full corpus.

theorem AnandHardtMcCloskey2021.reason_not_majority :

2 * RemnantSemanticType.reason.totalCount < dataSet.totalSluices

Reason is the largest remnant type but NOT a majority of all sluices: at 1,752/4,700 (≈37.3%), twice the Reason count still falls short of the total. This is the corrected form of the paper's "why dominates" finding — why is the single most frequent remnant type, not a majority.

Attested vs. absent mismatches #

theorem AnandHardtMcCloskey2021.modality_most_frequent_mismatch :

MismatchDimension.modality.corpusCount > MismatchDimension.tense.corpusCount ∧ MismatchDimension.modality.corpusCount > MismatchDimension.polarity.corpusCount ∧ MismatchDimension.modality.corpusCount > MismatchDimension.newWords.corpusCount

Modality mismatches are the most frequent mismatch type.

theorem AnandHardtMcCloskey2021.no_voice_mismatches :

MismatchDimension.voice.corpusCount = 0

Voice mismatches have zero attestations in the corpus (the paper notes this is consistent with rarity, not proven impossibility — §5.5).

theorem AnandHardtMcCloskey2021.no_argstructure_mismatches :

MismatchDimension.argumentStructure.corpusCount = 0

Argument-structure (transitivity) mismatches have zero attestations in the corpus (likewise consistent with rarity rather than impossibility — §5.5).