Documentation

Linglib.Phenomena.WordOrder.Studies.Shieber1985

Shieber (1985) @cite{shieber-1985} #

Evidence against the Context-Freeness of Natural Language. Linguistics and Philosophy, 8(3), 333–343.

Core Argument #

@cite{shieber-1985} proves that Swiss German is not weakly context-free, using a purely string-based argument that makes no assumptions about constituent structure or semantics. The proof rests on four empirical claims about Swiss German subordinate clauses, plus the closure of context-free languages under string homomorphism and intersection with regular languages (proven in Linglib.Core.Computability.ContextFreeGrammar.{Map, InterRegular} following @cite{bar-hillel-perles-shamir-1961} / @cite{hopcroft-motwani-ullman-2000} Theorems 7.24, 7.27).

The Four Claims #

  1. Swiss German subordinate clauses have structures where all Vs follow all NPs.
  2. Among such sentences, those with all DAT-NPs before all ACC-NPs, and all DAT-Vs before all ACC-Vs, exist as a regular-intersection-filterable subset. This is not a grammaticality condition on Swiss German itself — SG freely allows interleaved NP orderings — but the case-sorted subset is what Shieber's homomorphism + regular intersection isolates.
  3. The number of DAT-Vs equals the number of DAT-NPs (and similarly for ACC).
  4. An arbitrary number of Vs can occur (subject to performance).

What Is Formalized Here #

swiss_german_not_contextFree proves: Swiss German subordinate clauses are not weakly context-free. The language swissGermanLang allows any token sequence whose cross-serial core (after stripping matrix boundary tokens — matrix subjects, complementizers, auxiliaries, finite verbs of the embedding clause) has shape nps ++ vs with case counts matched. Matrix tokens may appear interleaved anywhere; Shieber's homomorphism erases them.

The proof exercises both legs of the Bar-Hillel schema (Language.not_isContextFree_via_witness from Closure.lean):

The witness language stringMap h L ⊓ R = ambncmdn — non-CF by ambncmdn_not_contextFree (two-parameter pumping). Per the Bar-Hillel schema, source SG is non-CF.

Proof bottoms out at the two CFL closure theorems in Linglib.Core.Computability.ContextFreeGrammar.{Map, InterRegular} (homomorphism + intersection-with-regular, both proven; see those files for @cite{bar-hillel-perles-shamir-1961} / @cite{hopcroft-motwani-ullman-2000} construction details) and ambncmdn_not_contextFree (the two-parameter pumping result) in Linglib.Core.Computability.NonContextFree.

A weaker pedagogical waypoint, swiss_german_diagonal_not_contextFree, uses only the simpler one-parameter anbncndn substrate; it covers just the diagonal (m = n) subset of SG clauses with no matrix material.

Remaining idealizations #

The matrix token is a single abstract constructor, not a richer model of SG matrix material's internal structure (specific lexical items, agreement, finite-verb fronting). For Shieber's purpose — showing non-CFness of the string set — this abstraction is faithful: he too treats matrix material as an erasable substring under the homomorphism. A richer model that ties matrix-token positions to specific SG fragment data would be a substantive extension; the current formalization captures Shieber's full argument without it.

Contrast with @cite{bresnan-etal-1982} #

@cite{bresnan-etal-1982}'s earlier argument for Dutch non-context-freeness relied on linguistic assumptions about constituent structure, which @cite{gazdar-pullum-1982} contested. @cite{shieber-1985}'s argument is purely formal — it rests entirely on the string set of Swiss German and the case-marking facts, making no claims about phrase structure.

Subsequent literature #

The processing-side dissociation (cross-serial easier than nested despite greater formal complexity) is the subject of @cite{bach-brown-marslen-wilson-1986}, formalized in Phenomena.WordOrder.Studies.BachBrownMarslenWilson1986 — not duplicated here.

A Swiss German subordinate clause token, abstracting over specific lexical items to their role in the cross-serial construction.

@cite{shieber-1985}'s argument projects SG sentences to four case-marked classes (DAT-NP / ACC-NP / DAT-V / ACC-V) plus boundary material. The matrix constructor abstracts non-cross-serial tokens — matrix subjects, complementizers, auxiliaries, finite verbs of the embedding clause, etc. Shieber's homomorphism erases these to ε; our tokenStringHom does the same by mapping matrix to []. This is what makes the proof apply to full Swiss German rather than just the NPV sub-shape — matrix tokens can appear interleaved anywhere in a sentence, and the homomorphism strips them, so the regex-filtered image picks up the cross-serial core regardless of how it's wrapped in the source string.

  • datNP : Token

    Dative NP (e.g., em Hans)

  • accNP : Token

    Accusative NP (e.g., d'chind, de Hans)

  • datV : Token

    Dative-subcategorizing verb (e.g., hälfe "help")

  • accV : Token

    Accusative-subcategorizing verb (e.g., lönd "let", aastriiche "paint")

  • matrix : Token

    Boundary material: matrix subject, complementizer, auxiliary, finite verb of the embedding clause, etc. Erased by tokenStringHom (Shieber's projection).

Instances For
    @[implicit_reducible]
    Equations
    def Shieber1985.instReprToken.repr :
    TokenStd.Format
    Equations
    Instances For
      @[implicit_reducible]
      Equations

      The case that a token bears or requires. Matrix tokens have no case; the field is total at .dat as a default (never consulted on matrix tokens in any consumer — case-matching predicates filter via isNP / isV first).

      Equations
      Instances For

        Whether a token is a case-marked NP (vs a verb or matrix material).

        Equations
        Instances For

          Whether a token is a case-subcategorizing verb (vs an NP or matrix material).

          Equations
          Instances For

            Whether a token is matrix (boundary) material (vs a cross-serial NP/V).

            Equations
            Instances For

              The abstract shape of a case-sorted cross-serial clause: counts of each (case × NP/V) combination, in the canonical order DAT-NPs, ACC-NPs, DAT-Vs, ACC-Vs.

              This is not a grammaticality condition on Swiss German — SG freely allows interleaved NP orderings — but the case-sorted shape is what Shieber's homomorphism + regular intersection isolates from the full SG language.

              • datNPs :
              • accNPs :
              • datVs :
              • accVs :
              Instances For
                def Shieber1985.instDecidableEqCrossSerialClause.decEq (x✝ x✝¹ : CrossSerialClause) :
                Decidable (x✝ = x✝¹)
                Equations
                • One or more equations did not get rendered due to their size.
                Instances For
                  Equations
                  • One or more equations did not get rendered due to their size.
                  Instances For

                    Claim 3: case matching — the number of dative verbs equals the number of dative NPs, and similarly for accusative.

                    Equations
                    Instances For

                      A case-sorted clause that satisfies case matching. Renamed from the misleading GrammaticalClause — case matching is part of grammaticality, but case-sorting is not (per Claim 2 framing above).

                      Instances For

                        Claim 4: any combination of dative and accusative verb counts can occur (we can produce a CaseMatchedClause for any m, n).

                        Equations
                        Instances For

                          Shieber's homomorphism f: maps Swiss German cross-serial clause tokens to the abstract alphabet {a, b, c, d}. Matrix tokens have no meaningful image; the placeholder .a is never observed by any consumer (every use site filters via isMatrix or constructs from non-matrix tokens). The real homomorphism tokenStringHom ERASES matrix tokens to [].

                          Equations
                          Instances For

                            Shieber's homomorphism, lifted to lists. Cross-serial NPs and Vs map to singleton {a,b,c,d} letters; matrix tokens are erased. This []-valued case is essential for full-SG fidelity: it lets the schema apply to sentences with arbitrary boundary material wrapping the cross-serial core. The function is no longer a StringHom.letterMap for that reason.

                            Equations
                            Instances For

                              The token sequence corresponding to a case-matched clause: NPs first (DAT then ACC), then Vs (DAT then ACC).

                              Equations
                              • One or more equations did not get rendered due to their size.
                              Instances For

                                The string image of a case-matched clause's token sequence under tokenStringHom: a string in {a,b,c,d}*.

                                Equations
                                Instances For

                                  The image of a case-matched clause is the StringHom.apply of tokenStringHom on its case-sorted token sequence.

                                  theorem Shieber1985.clauseImage_shape (m n : ) :
                                  clauseImage (arbitraryDepth m n) = List.replicate m FourSymbol.a ++ List.replicate n FourSymbol.b ++ List.replicate m FourSymbol.c ++ List.replicate n FourSymbol.d

                                  A case-matched clause with m DAT-pairs and n ACC-pairs maps to aᵐ bⁿ cᵐ dⁿ.

                                  Setting m = n in the clause image gives aⁿbⁿcⁿdⁿ.

                                  The diagonal clause images are in {aⁿbⁿcⁿdⁿ}.

                                  The diagonal subset of Swiss German cross-serial clauses (encoded as token sequences): case-matched, case-sorted, with m = n.

                                  Equations
                                  Instances For

                                    The homomorphic image of the diagonal SG language under tokenStringHom is exactly anbncndn.

                                    This is the load-bearing equality for Shieber's argument: it sits at the interface between linguistic data (the diagonal SG clauses) and the formal-language witness (anbncndn) whose non-CF status is established in Linglib.Core.Computability.NonContextFree.

                                    Diagonal Swiss German is not context-free. A formal corollary of @cite{shieber-1985}'s argument, using only the one-parameter substrate {aⁿbⁿcⁿdⁿ}. The image of the diagonal SG language under tokenStringHom equals anbncndn (not CF), so by closure under string homomorphism (Linglib.Core.Computability.ContextFreeGrammar.Closure), the source is not CF.

                                    The Swiss German cross-serial language. A token list ts is in this language iff:

                                    • stripping matrix tokens leaves a list of form nps ++ vs (NPs precede Vs in the cross-serial core), and
                                    • the case counts match (#DAT-NP = #DAT-V, #ACC-NP = #ACC-V).

                                    Matrix tokens (subjects, complementizers, auxiliaries) may appear anywhere interleaved with cross-serial NPs/Vs — Shieber's homomorphism erases them. The case-sorted canonical form caseSortedTokens (arbitraryDepth m n) (no matrix tokens, NPs and Vs each in canonical case order) is properly contained.

                                    Equations
                                    • One or more equations did not get rendered due to their size.
                                    Instances For

                                      Every case-matched, case-sorted clause (no matrix material) is in the (more general) free SG language.

                                      DFA states for the case-sorted shape a*b*c*d*.

                                      Instances For
                                        @[implicit_reducible]
                                        Equations
                                        @[implicit_reducible]
                                        Equations
                                        • One or more equations did not get rendered due to their size.
                                        Equations
                                        • One or more equations did not get rendered due to their size.
                                        Instances For

                                          DFA recognizing a*b*c*d* over FourSymbol.

                                          Equations
                                          • One or more equations did not get rendered due to their size.
                                          Instances For

                                            The case-sorted shape language a*b*c*d* on FourSymbol.

                                            Equations
                                            Instances For

                                              Concrete witness: every aᵐbⁿcᵐdⁿ is case-sorted.

                                              theorem Shieber1985.caseSorted_decomp (xs : List FourSymbol) (h : xs caseSorted) :
                                              ∃ (p : ) (q : ) (r : ) (u : ), xs = List.replicate p FourSymbol.a ++ List.replicate q FourSymbol.b ++ List.replicate r FourSymbol.c ++ List.replicate u FourSymbol.d

                                              Any case-sorted FourString decomposes uniquely into block-sorted aᵖ ++ bᵠ ++ cʳ ++ dˢ.

                                              The Bar-Hillel-style intersection equality: full SG (with arbitrary matrix material), projected to {a,b,c,d} and filtered to case-sorted shape, equals exactly the non-CF language aᵐbⁿcᵐdⁿ.

                                              @cite{shieber-1985}'s main theorem. Swiss German subordinate clauses are not weakly context-free.

                                              Proof structure (the actual Shieber schema, both legs exercised): apply Language.not_isContextFree_via_witness with the tokenStringHom homomorphism + caseSorted regular filter; the witness is that stringMap h L ⊓ R = ambncmdn, which is non-CF by ambncmdn_not_contextFree (two-parameter pumping). The R-leg matters: swissGermanLang allows interleaved NP/V orderings, and the regular filter forces the case-sorted canonical shape needed to land in ambncmdn.

                                              Example (1): mer em Hans es huus hälfed aastriiche "we helped Hans paint the house"

                                              em Hans (DAT) → hälfed (DAT-verb "helped"); es huus (ACC) → aastriiche (ACC-verb "paint").

                                              Equations
                                              Instances For

                                                Example (5): triply embedded cross-serial clause mer d'chind em Hans es huus lönd hälfe aastriiche "we let the children help Hans paint the house"

                                                d'chind (ACC) → lönd (ACC-verb "let"); em Hans (DAT) → hälfe (DAT-verb "help"); es huus (ACC) → aastriiche (ACC-verb "paint").

                                                With case sorting: 1 DAT-NP, 2 ACC-NPs, 1 DAT-V, 2 ACC-Vs.

                                                Equations
                                                Instances For

                                                  The homomorphic image of example (5) is abbcdd = a¹b²c¹d².