LinguisticExample schema #
Substrate types for typed linguistic example data, aligned with the CLDF Examples component (Cross-Linguistic Data Formats; Forkel, List, Greenhill, Bank, et al. — DLCE/MPI-EVA).
Per-paper example data lives in Linglib/Studies/{Phenom}/{AuthorYear}.lean,
inside a namespace Examples ... end block delimited by marker comments and
auto-generated from Data/Examples/{AuthorYear}.json by
scripts/gen_examples.py. Each LinguisticExample carries a stable ID
mirroring the source paper's example numbering, plus Leipzig-style gloss and
provenance via SourceRef.
This schema is the substrate; it does not import anything from Linglib/.
Consumers (Phenomena/, Studies/) import it. Per CLAUDE.md, Data/ is
substrate alongside Core/, Features/, etc.
Leipzig glossing conventions #
The gloss component of glossedTokens follows the Leipzig Glossing Rules
(Comrie, Haspelmath, & Bickel 2008; spec at
https://www.eva.mpg.de/lingua/pdf/Glossing-Rules.pdf):
-separates affixes from stems and one another (segmentable morpheme boundary).separates two glosses corresponding to one form (fusion / portmanteau)=separates clitics from hosts- SMALL CAPS for grammatical category labels (e.g.,
INDEF,3SG,PST,REL,NOM); plain lowercase for lexical glosses (e.g.,farmer,dog) 1/2/3for person;SG/DU/PLfor number; gender labels (M/F/N) when relevant; case labels (NOM/ACC/GEN/...)
Token-gloss alignment is structural: glossedTokens : List (String × String)
pairs each analyzed-word token with its gloss directly, eliminating the
class of JSON regressions where the two parallel lists drift out of length.
Helpers surfaceTokens and glossLine recover each component when needed.
Out of scope (for now) #
The LinguisticExample schema as committed handles single-sentence felicity
data with morpheme-aligned gloss. It does NOT yet model:
- Multi-sentence discourse contexts (anaphora resolution, modal subordination)
- Multiple LFs per surface form (scope ambiguity, reconstruction)
- Structural / tree-based examples (where the bracketing is the data)
- Minimal-pair clusters (paradigms that only mean something together)
- Gradient acceptability beyond the 5-level enum
These extensions land when a consuming study needs them; the schema can grow optional fields without breaking existing rows.
Glottolog 5.0 language identifier (e.g. "stan1293" for Standard English).
Type alias only; values are not validated against Glottolog at the type
level. Matches the convention in Data.PHOIBLE.Inventory.glottocode
and Data.WALS.Datapoint.iso.
Equations
- Data.Examples.Glottocode = String
Instances For
Acceptability / felicity judgment scale. Five levels matching the Schütze/Sprouse standard (✓, ?, ??, ?/ boundary, *).
Star-character constructors (*, ?) are avoided because * is a Lean
reserved token and ? collides with tactic syntax. The named cases below
cover the same judgment scale unambiguously.
Use .acceptable for clean grammatical/felicitous data; reserve
.ungrammatical for hard star judgments and .unacceptable for
pragmatic/felicity-failed data that isn't strictly ungrammatical.
Constructor order encodes "worse" (.acceptable is best, .ungrammatical
worst); the derived Ord makes "this paper rates X worse than Y"
theorems possible without an extra wrapper.
- acceptable : Judgment
- marginal : Judgment
- questionable : Judgment
- unacceptable : Judgment
- ungrammatical : Judgment
Instances For
Equations
- Data.Examples.instDecidableEqJudgment x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯
Equations
Equations
- Data.Examples.instBEqJudgment.beq x✝ y✝ = (x✝.ctorIdx == y✝.ctorIdx)
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- Data.Examples.instReprJudgment = { reprPrec := Data.Examples.instReprJudgment.repr }
Equations
Equations
- Data.Examples.instOrdJudgment = { compare := Data.Examples.instOrdJudgment.ord }
Equations
- Data.Examples.instOrdJudgment.ord x✝ y✝ = compare x✝.ctorIdx y✝.ctorIdx
Instances For
Reference to a published source for an example. Two fields kept separate
rather than concatenated so each can be grepped independently:
bibkey matches an entry in blog/data/references.bib; paperLabel
is the source paper's own example numbering (e.g. "(58a)", "ex 12",
"donkey-classic"). The CLDF convention encodes both as a single string
"bibkey[label]"; we use a struct here because typed Lean rewards
structural decomposition.
- bibkey : String
- paperLabel : String
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- Data.Examples.instReprSourceRef = { reprPrec := Data.Examples.instReprSourceRef.repr }
A single linguistic example with surface form, gloss, judgment, and
provenance. CLDF-aligned core fields with two optional CLDF fields
(metaLanguage, lgrConformance) carrying their CLDF defaults.
The id field is the canonical reference for theorems: a study file
proving theorem T_predicts_kratzer1991_58a makes the example ID part
of the public record, grep-able across the codebase.
Field semantics:
id: stable, paper-keyed identifier (e.g. "charlow2014_donkey1").source:bibkey+paperLabelof the originating paper — where the example was first introduced or attested. For most examples cited in subsequent literature,sourcerecords the original (e.g. Geach 1962 for the canonical donkey sentence).reportedIn: optional secondSourceRefof the paper whose JSON file this example sits in, when that paper is not the originating source.nonemeans the JSON's owning paper is also the originator. Use case: Schwarz 2013 cites Ebert 1971a's Fering example (8); the JSON row recordssource = ebert-1971a,reportedIn = some ⟨schwarz-2013, "(8)"⟩.language: Glottocode of the example's object language.primaryText: surface string in the object language. For multi-sentence discourse examples, this is the concatenation ofdiscourseSegments(or empty ifdiscourseSegmentsis non-empty and you prefer to read the segments directly).discourseSegments: list of utterance-level segments for multi-sentence examples (anaphora resolution, ellipsis, modal subordination, donkey discourses). Empty list[]means "single-sentence example; useprimaryText". Non-empty list breaks the discourse into named utterances; the order is the utterance order. Hofmann 2025's(antecedentSentence, anaphorSentence)pattern is the canonical case.glossedTokens: list of(analyzed-word, gloss)pairs. The pair-list shape makes alignment by-construction — noisAlignedpredicate is needed because misalignment is unrepresentable. Spans the fullprimaryText(or full discourse if multi-sentence).translation: minimal English (ormetaLanguage) translation.context: discourse / scenario context where the judgment holds. Empty string means context-free judgment (rare; most felicity judgments are context-relative).judgment: acceptability/felicity ofprimaryTextper theJudgmentscale. Sentence-level, not reading-level.alternatives: list of(form, judgment)pairs for within-example contrast pairs — competing surface forms that could fill the same slot, each with its own judgment. Empty[]means "no alternatives recorded". Schwarz's "#vom / von dem Buch" pattern is the canonical case:primaryTextcarries the felicitous form;alternativescarries the infelicitous variants. Eachformis a full alternativeprimaryText(whole sentence with substitution), not just the substituted phrase, so it's interpretable in isolation.readings: list of(name, judgment)pairs for multiple LFs / interpretations of the sameprimaryText. Empty[]means "no reading-level annotations" (the example is unambiguous or the analyst doesn't distinguish readings). Donkey examples (("strong/universal", .acceptable), ("weak/existential", .acceptable)) and scope-ambiguous examples are the canonical cases. Sentence acceptability (judgment) is independent of reading availability: a sentence can be.acceptableand still have one of two readings blocked.paperFeatures: list of(key, value)pairs for paper-specific analytical classifications of this example. This is the bottom layer of the paper-relativized claim model: a typological survey with a feature inventory (tam,ET_ST,evidence_type, etc.) attaches its classifications here as flat strings. The future richer form will beclassifications : List (SourceRef × String × String)carrying which paper's framework the classification is under; for now the JSON-owning paper's framework is assumed implicit. Empty[]means "no paper-specific tags". Different papers can use orthogonal feature sets; consumers project out whatever they need.comment: free-form notes (analyst caveats, methodological notes).metaLanguage: CLDF field; defaults to "stan1293" (English).lgrConformance: CLDF field; "" / "WORD_ALIGNED" / "MORPHEME_ALIGNED".
- id : String
- source : SourceRef
- reportedIn : Option SourceRef
- language : Glottocode
- primaryText : String
- discourseSegments : List String
- glossedTokens : List (String × String)
- translation : String
- context : String
- judgment : Judgment
- alternatives : List (String × Judgment)
- readings : List (String × Judgment)
- paperFeatures : List (String × String)
- comment : String
- metaLanguage : Glottocode
- lgrConformance : String
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
Equations
- One or more equations did not get rendered due to their size.
Instances For
Surface-form tokens (analyzed-word side of glossedTokens).
Equations
- e.surfaceTokens = List.map Prod.fst e.glossedTokens
Instances For
Gloss tokens (Leipzig-conventions side of glossedTokens).
Equations
- e.glossLine = List.map Prod.snd e.glossedTokens