Documentation

Linglib.Studies.AnandHardtMcCloskey2021

[AHMcC21] — The Santa Cruz sluicing data set #

[AHMcC21]

Distributional findings from the Santa Cruz sluicing data set, a 4,700-example annotated corpus of naturally occurring English sluices drawn from the New York Times subset of the Gigaword corpus, built by the Santa Cruz Ellipsis Project (SCEP).

Key Findings #

  1. Sprouting dominates: 65.5% of sluices have no overt correlate, overturning the theoretical literature's focus on merger (34.5%).

  2. why / Reason is the largest remnant type, not a majority. why sluices are 53.8% of sprouting and 37.2% of all sluices; the Reason semantic type (Table 1) totals 1,752/4,700 (≈37.3%) — the single most frequent remnant type, but well short of a majority of all sluices. (The "53.8% of the total" figure holds only in the paper's footnote variant that folds in why not questions, which the authors caution "should probably not be analyzed in the same terms as sluicing.")

  3. Mismatches are systematic but bounded: tense (129), modality (394), polarity (28), and new-words (71 clear cases) mismatches between antecedent and ellipsis site are attested. Voice and argument-structure (transitivity) mismatches have zero attestations — which the paper is careful to note is consistent with great rarity, not demonstrated impossibility.

  4. Embedding is the norm: 72.4% of sluices are embedded (3,404), vs. 27.6% root (1,296).

The paper is a descriptive data-set report; it explicitly defers theoretical interpretation to a companion paper. The Syntactic-Isomorphism-Condition bridge theorems that relate this corpus to a formal-matching account therefore live in Studies/AnandHardtMcCloskey2025.lean, not here (chronological grounding).

Antecedent–ellipsis mismatch dimensions (§5) #

The data set documents which form/interpretation mismatches between antecedent and ellipsis site are attested and which are absent. Attested mismatches challenge strict syntactic-identity requirements; the absent ones (especially argument structure) constrain theories of ellipsis licensing. These six dimensions are the §5 mismatch taxonomy — distinct from the paper's broader annotation ontology (antecedent/wh-remnant/paraphrase/correlate obligatory tags; the E-TYPE, IGNORE, ISLAND, MISSINGANTE, QEMBEDDER global tags), which is not modelled here.

Dimension along which antecedent and ellipsis site can differ (§5).

Instances For
    @[implicit_reducible]
    Equations
    Equations
    • One or more equations did not get rendered due to their size.
    Instances For

      Corpus count for each mismatch dimension (§5).

      The two zero counts (voice, argument structure) record non-attestation in this 4,700-example corpus. The paper (§5.5) stresses this is consistent with the mismatches being very rare rather than impossible.

      Equations
      Instances For

        Remnant semantic-type distribution (Table 1) #

        The wh-remnant semantic-type distribution is a central descriptive contribution: prior work (Ross 1969, Chung et al. 1995, Merchant 2001) focused on Entity remnants, which are only 13.8% here, while Reason (typically why) and Degree dominate.

        Semantic type of the wh-remnant (Table 1).

        Instances For
          @[implicit_reducible]
          Equations
          Equations
          • One or more equations did not get rendered due to their size.
          Instances For

            Total count of each remnant type (Table 1, TOTAL column), derived.

            Equations
            Instances For

              Corpus summary statistics #

              Percentages are stored as integer tenths (655 = 65.5%) since every downstream fact about them is an integer (in)equality; this preserves the paper's reported precision without rational arithmetic.

              Summary statistics from the Santa Cruz sluicing data set.

              • totalSluices : Nat
              • sproutingPctTenths : Nat
              • mergerPctTenths : Nat
              • rootPctTenths : Nat
              • embeddedPctTenths : Nat
              • antecedentlessCount : Nat
              Instances For
                Equations
                • One or more equations did not get rendered due to their size.
                Instances For

                  The data-set summary.

                  Equations
                  • AnandHardtMcCloskey2021.dataSet = { totalSluices := 4700, sproutingPctTenths := 655, mergerPctTenths := 345, rootPctTenths := 276, embeddedPctTenths := 724, antecedentlessCount := 167 }
                  Instances For

                    Corpus distribution #

                    Sprouting is the majority sluice kind (overturning the literature's focus on merger).

                    The reported sprouting/merger percentages are mutually consistent (sum to 100%).

                    The reported root/embedded percentages are mutually consistent (sum to 100%).

                    Reason is the largest remnant type but NOT a majority of all sluices: at 1,752/4,700 (≈37.3%), twice the Reason count still falls short of the total. This is the corrected form of the paper's "why dominates" finding — why is the single most frequent remnant type, not a majority.

                    Attested vs. absent mismatches #

                    Voice mismatches have zero attestations in the corpus (the paper notes this is consistent with rarity, not proven impossibility — §5.5).

                    Argument-structure (transitivity) mismatches have zero attestations in the corpus (likewise consistent with rarity rather than impossibility — §5.5).